Question

Ask About The Empirical Quality Score Mentioned In Gatk Recalibration Step

1

Entering edit mode

12.4 years ago

Liye Zhang ▴ 80

Hi,

I tried to find more details about the empirical quality score(more specifically, figure 3 in the paper: http://www.nature.com/ng/journal/v43/n5/abs/ng.806.html used in the comparison provided in the GATK paper on quality score recalibration, but I could not find it. I wonder someone can give me some ideas on how are empirical quality score caluculated and obtained? As I understand, there is sequencing bias in base pair composition and read length, therefore the recalibrated quality score should be better. Still, it will be great if someone can explain or elaborate on the concept of empirical quality score a little bit more.

Thanks.

gatk quality scoring • 4.1k views

ADD COMMENT • link updated 12.4 years ago by Jorge Amigo 14k • written 12.4 years ago by Liye Zhang ▴ 80

0

Entering edit mode

Have you looked at the methods section of the paper that you are referring to? It seems to describe the mathematical background of the base quality recalibration. I'm sorry I can't give you a better answer than that - my own understanding of the procedure doesn't extend further than that.

ADD REPLY • link 12.4 years ago by Johan ▴ 890

score 2 · Answer 1 · 2012-07-09

You should consult the online methods sections here http://www.nature.com/ng/journal/v43/n5/extref/ng.806-S1.pdf

There you will find the Base miscalling confusion matrices section that describes the how they found the empirical error rates that is specific to each platform and each miscall.

The way I understood this is that various miscalls have different rates of occurring even though that the reported quality may be the same.

score 0 · Answer 2 · 2012-07-09

GATK's base quality score recalibration step is broadly described in its wiki page. the way I understand this step is, roughly speaking, like removing the background noise of analog signals. once removed known polymorphic sites from this process (dbSNP sites are suggested to be used here), the idea is to normalize all the base qualities by lowering the "noise" they may share. this way you'll be able to find low signals (low quality bases) that would be otherwise "lost in translation".