With the installation of a new Illumina HiSeq sequencer in our building it has emerged that there is now an option to output reduced resolution Q-scores (Phred scores).
The Illumina white paper can be viewed freely by following the link: "Reducing Whole-Genome Data Storage Footprint".
What it basically boils down to is making 8 quality score groups that each contain ranges of the previously individual scores. E.g. their example is the original scores 20-24 would be combined into a new bin with a value of 22.
The white paper shows that there will be gains due to reduced compressed file sizes, without a loss of accuracy in analyses such as variant detection.
I would be interested to know whether people think this is a reasonable step to take in order to reduce ever increasing file sizes, or whether the loss of accuracy is a problem.
Thanks!
I've always wondered how NGS machines calculated quality scores. As far as I know, the original phred scores did it by looking at various metrics of the chromatogram peaks in sanger sequenced reads. Then using a quality lookup table for each metric, quality scores are calculated and assigned. How does that translate to NGS machines? Does the software look at the signal peaks and use the same lookup tables? Wouldn't sanger peaks be completely different from illumina/454/SOLiD signal peaks due to sequencing chemistry?
For NGS sequencers a "phred-like" quality scores are used. The machine calculates its own quality scores not based on the true-phred assumptions.