Hi! I'm new to bioinformatics, but am working with some fastq files that have some strange base quality distributions, see image. It is strange as we see only 4 unique phred scores across the whole file, which seems surprising given illumina sequencing has 41 possible scores. This is happening across multiple files, and these files are straight from the sequencing company. The values correspond to phred scores of "F" "," ":" and "#".
I have confirmed this behaviour with multiple people in my team, so this is not an analysis problem this is an issue with the files (also obvious when looking at raw read phred scores).
I also found this other question on Biostars, which it's hard to tell but they seem to have the same behaviour, suggesting perhaps it is a common issue. Does anyone have any idea what is happening? The reads themselves seem normal when compared to reference genome.
We have contacted the sequencing company and they haven't really provided clarity so thought that maybe people here could provide some insight.
Thanks in advance!
While this is not going to be needed, BBMap suite offers a tool that will allow you to recalibrate the Q scores based on alignments. Tool is called
calctruequality.sh
.