Question

Weird fastq quality distributions

2

Entering edit mode

3.2 years ago

evissc ▴ 30

Hi! I'm new to bioinformatics, but am working with some fastq files that have some strange base quality distributions, see image. It is strange as we see only 4 unique phred scores across the whole file, which seems surprising given illumina sequencing has 41 possible scores. This is happening across multiple files, and these files are straight from the sequencing company. The values correspond to phred scores of "F" "," ":" and "#".

I have confirmed this behaviour with multiple people in my team, so this is not an analysis problem this is an issue with the files (also obvious when looking at raw read phred scores).

I also found this other question on Biostars, which it's hard to tell but they seem to have the same behaviour, suggesting perhaps it is a common issue. Does anyone have any idea what is happening? The reads themselves seem normal when compared to reference genome.

We have contacted the sequencing company and they haven't really provided clarity so thought that maybe people here could provide some insight.

Thanks in advance!

Aggregation of base qualities over fastq files for same sample

fastq distribution • 1.6k views

ADD COMMENT • link updated 3.2 years ago by GenoMax 152k • written 3.2 years ago by evissc ▴ 30

1

Entering edit mode

3.2 years ago

ATpoint 88k

It means that the vast majority of bases are of best quality, that is a good thing, so what is the issue here? Be happy about it.

ADD COMMENT • link 3.2 years ago by ATpoint 88k

0

Entering edit mode

The issue is that the distribution is surprising- why only 4 distinct values, why isn't there a range of high qualities instead of just this one high value. I am unsure what a standard distribution is on WGS, but was expecting something like the image included (ignore the equal height thing).

Taken from [GATK][1] website

ADD REPLY • link 3.2 years ago by evissc ▴ 30

1

Entering edit mode

What kind of data is this? I mean very recent? illumina? that sort of things ...

If I remember correctly, was Illumina not gonna change it's qual scores, in binned approach to reduce file size?

ADD REPLY • link 3.2 years ago by lieven.sterck 15k

1

Entering edit mode

Ah thank you @lievensterck- found this which seems to be it!

ADD REPLY • link 3.2 years ago by evissc ▴ 30

1

Entering edit mode

That is probably how whatever tool you used to make this graph made it. Run it through fastqc if you want the "standard" way or write a script that collects the metrics into a proper histogram. WIthout information on the tool I cannot comment. Honestly though, quality is good -- go ahead with the analysis and don't waste time on base qualities.

ADD REPLY • link 3.2 years ago by ATpoint 88k

score 1 · Accepted Answer · 2022-05-20

1

Entering edit mode

3.2 years ago

evissc ▴ 30

Reposting from lieven.sterck comment:

"If I remember correctly, was Illumina not gonna change it's qual scores, in binned approach to reduce file size?"

From which I found this answer which seems to be it.

ADD COMMENT • link 3.2 years ago by evissc ▴ 30

1

Entering edit mode

While this is not going to be needed, BBMap suite offers a tool that will allow you to recalibrate the Q scores based on alignments. Tool is called calctruequality.sh.

ADD REPLY • link 3.2 years ago by GenoMax 152k