Hi,
In the 1000 genomes VCF files, QUAL represents "a phred-scaled quality score for the assertion made in ALT". Does anybody know how they actually calculated this and what factors they consider?
Thanks for any help!
Hi,
In the 1000 genomes VCF files, QUAL represents "a phred-scaled quality score for the assertion made in ALT". Does anybody know how they actually calculated this and what factors they consider?
Thanks for any help!
The majority of our SNP and Indel sites are assess using the Variant Quality Score Recalibrator from the Broad's GATK
You should find the papers both about GATK and VQSR useful to explain these things
http://www.broadinstitute.org/gsa/wiki/index.php/Main_Page http://www.broadinstitute.org/gsa/wiki/index.php/Variant_quality_score_recalibration http://genome.cshlp.org/content/20/9/1297.abstract http://www.nature.com/ng/journal/v43/n5/full/ng.806.html
The VCF QUAL score is simply the Phred scales quality score.
The formula you require are
Q= -10(Log10P)
P= 10**(-Q/10) ** indicates to the power
So If you take the following as an example:
A Phred quality of 30 indicates a probability of 1/1000 chance the base has been called incorrectly.
so Q=30 and P=1/1000
30= -10(Log10(1/1000))
or
1/1000=10**(-30/10)
I hope this helps in some way.
Hi,
I found this in the GATK paper.
"In brief, our example genotyper computes the posterior probability of each genotype, given the pileup of sequencer reads that cover the current locus, and expected heterozygosity of the sample. This computation is used to derive the prior probability each of the possible 10 diploid genotypes, using the Bayesian formulation (Shoemaker et al. 1999)
[Formula here]
where D represents our data (the read base pileup at this reference base) and G represents the given genotype. The term p(G) is the prior probability of seeing this genotype, which is influenced by its identity as a homozygous reference, heterozygous, or homozygous nonreference genotype. The value p(D) is constant over all genotypes, and can be ignored, and
[another formula here]
where b represents each base covering the target locus. The probability of each base given the genotype is defined as [even one more formulas here], when the genotype G = {Aa,A2} is decomposed into its two alleles. The probability of seeing a base given an allele is
and the epsilon term e is the reversed phred scaled quality score at the base. Finally, the assigned genotype at each site is the genotype with the greatest posterior probability, which is emitted to disk if its log-odds score exceeds a set threshold."
So in my understanding they take the depth and base quality into this estimation.
M
since 1000 genomes calls are GATK based, aside from the readings that Laura suggests, I would highly recommend to dig into GATK's site and extract valuable information from it:
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.