Question

Qual Scores In 1000 Genomes Vcf File

3

Entering edit mode

13.5 years ago

Simon ▴ 40

Hi,

In the 1000 genomes VCF files, QUAL represents "a phred-scaled quality score for the assertion made in ALT". Does anybody know how they actually calculated this and what factors they consider?

Thanks for any help!

genome • 10k views

ADD COMMENT • link updated 4.6 years ago by Biostar 20 • written 13.5 years ago by Simon ▴ 40

score 4 · Answer 1 · 2012-02-10

The majority of our SNP and Indel sites are assess using the Variant Quality Score Recalibrator from the Broad's GATK

You should find the papers both about GATK and VQSR useful to explain these things

http://www.broadinstitute.org/gsa/wiki/index.php/Main_Page http://www.broadinstitute.org/gsa/wiki/index.php/Variant_quality_score_recalibration http://genome.cshlp.org/content/20/9/1297.abstract http://www.nature.com/ng/journal/v43/n5/full/ng.806.html

Ram · Answer 2 · 2014-06-06

The VCF QUAL score is simply the Phred scales quality score.

Phred Quality score (Q)
Probability that a base is incorrectly called (P)

The formula you require are

Q= -10(Log10P)
P= 10**(-Q/10)  ** indicates to the power

So If you take the following as an example:

A Phred quality of 30 indicates a probability of 1/1000 chance the base has been called incorrectly.

so Q=30 and P=1/1000

30= -10(Log10(1/1000))

or

1/1000=10**(-30/10)

I hope this helps in some way.

score 2 · Answer 3 · 2012-02-22

Hi,

I found this in the GATK paper.

"In brief, our example genotyper computes the posterior probability of each genotype, given the pileup of sequencer reads that cover the current locus, and expected heterozygosity of the sample. This computation is used to derive the prior probability each of the possible 10 diploid genotypes, using the Bayesian formulation (Shoemaker et al. 1999)

[Formula here]

where D represents our data (the read base pileup at this reference base) and G represents the given genotype. The term p(G) is the prior probability of seeing this genotype, which is influenced by its identity as a homozygous reference, heterozygous, or homozygous nonreference genotype. The value p(D) is constant over all genotypes, and can be ignored, and

[another formula here]

where b represents each base covering the target locus. The probability of each base given the genotype is defined as [even one more formulas here], when the genotype G = {Aa,A2} is decomposed into its two alleles. The probability of seeing a base given an allele is

and the epsilon term e is the reversed phred scaled quality score at the base. Finally, the assigned genotype at each site is the genotype with the greatest posterior probability, which is emitted to disk if its log-odds score exceeds a set threshold."

So in my understanding they take the depth and base quality into this estimation.

M

score 2 · Answer 4 · 2012-02-22

since 1000 genomes calls are GATK based, aside from the readings that Laura suggests, I would highly recommend to dig into GATK's site and extract valuable information from it:

the unified genotyper and its quality score calculation are described in the proper variant calling algorithm page, which should strictly answer your question about the score and its formula.
also, it's very useful to know that GATK can consider a set of known variant sites in order to perform a base quality score recalibration, which would ultimately help the previously described algorithm
finally, there are useful recommendations for variant detection, which include things like marking/removing duplicated reads, realigning around indels, or the recalibration mentioned above.