Illumina'S New Reduced Resolution Q (Phred) Scores - Good Or Bad?
2
5
Entering edit mode
11.4 years ago
Ian 6.1k

With the installation of a new Illumina HiSeq sequencer in our building it has emerged that there is now an option to output reduced resolution Q-scores (Phred scores).

The Illumina white paper can be viewed freely by following the link: "Reducing Whole-Genome Data Storage Footprint".

What it basically boils down to is making 8 quality score groups that each contain ranges of the previously individual scores. E.g. their example is the original scores 20-24 would be combined into a new bin with a value of 22.

The white paper shows that there will be gains due to reduced compressed file sizes, without a loss of accuracy in analyses such as variant detection.

I would be interested to know whether people think this is a reasonable step to take in order to reduce ever increasing file sizes, or whether the loss of accuracy is a problem.

Thanks!

illumina • 4.3k views
ADD COMMENT
3
Entering edit mode

I've always wondered how NGS machines calculated quality scores. As far as I know, the original phred scores did it by looking at various metrics of the chromatogram peaks in sanger sequenced reads. Then using a quality lookup table for each metric, quality scores are calculated and assigned. How does that translate to NGS machines? Does the software look at the signal peaks and use the same lookup tables? Wouldn't sanger peaks be completely different from illumina/454/SOLiD signal peaks due to sequencing chemistry?

ADD REPLY
1
Entering edit mode

For NGS sequencers a "phred-like" quality scores are used. The machine calculates its own quality scores not based on the true-phred assumptions.

ADD REPLY
5
Entering edit mode
11.4 years ago

I investigated this for variant calling pipelines and found only a small impact on alignment and minimal impact on variant calling:

http://bcbio.wordpress.com/2013/02/13/the-influence-of-reduced-resolution-quality-scores-on-alignment-and-variant-calling/

For variant calling, the main influence is on calling in low coverage (< 10x) regions. Calling in these regions is already difficult so this only adds to the general belief that you should careful verify calls with low coverage support.

Brendan O'Fallon also looked into this and came to similar conclusions:

http://basecallbio.wordpress.com/2013/04/23/base-quality-score-rebinning/

ADD COMMENT
3
Entering edit mode
11.4 years ago

I think no accuracy is lost, the measure was never that accurate to being with. It is the precision that is reduced and for good reason. The granularity of the quality score is far smaller than the error of determining the value.

IMHO the "quality measures" are no more than a "guesstimates". As far as I am concerned there are only two quality scores, good and bad with a " no-man's-land" in between. Sequences that show deteriorating quality scores indicate a systematic error in the process and are unrelated to the actual miscalling of bases.

ADD COMMENT

Login before adding your answer.

Traffic: 2004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6