Hi Everyone
I am starting to find SNP's in my dataset and i am reading online and i come across the term genotype likelihoods. Can you explain me what it means.
Thanks
V
Hi Everyone
I am starting to find SNP's in my dataset and i am reading online and i come across the term genotype likelihoods. Can you explain me what it means.
Thanks
V
@Dan Gaston
However there are situations in VCF outputs when the most probable genotype is different from the one that is reported, for example:
Having a VCF file looking like:
CHROM POS ID Ref Alt Filter GT:AD:DP:GQ:PL
chr1 845668 . C T [CLIPPED] 0/1:1,3:4:25,92:103,0,26
Lets focus on the GT field and PL:
0/1, 103,0,26
GT is given 0/1, so the heterozygous however PL field reports most probable genotype as 1/1[value 26 = ~ 25% of chance that this is the correct one]. other genotypes from PL field: 0/1[0 probability], 0/0[value 103, also very small probability]. The GT reported as 0/1 comes in this case however from DP field, which shows 1,3:4 - meaning 4 reads span this position, out of which 3 report ALT allel, and 1 report REF allele. So one has to be careful when calling GT, the most probable position is encoded in PL field (even though it's not always given in VCF)
Here is a bit of explanation in a human form:
Definitely read the link that stolarek.ir posted. I thought I would just briefly state for a very short and oversimplified answer that most genotype callers use some sort of probabilistic model for determining whether a position matches the reference assembly used or has one or more variant alleles at that position. They also typically have different models for SNPS versus indels. Most of these models are Bayesian and therefore the genotype likelihood, in plain language, is the probability of a specific genotype given the data nucleotides at that position from the aligned reads that pass some filter(s). The genotype with the best likelihood (highest probability) is picked as the observed genotype.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
That's not a correct interpretation of the PL field. PL values are phred-scaled likelihood scores, normalized such that the most likely genotype will have a score of 0. So the approximate likelihoods are 10^(-PL/10). In this case, for PL values of 103,0,26 the likelihoods would be
10^(-10.3) approximately 5.0E-11
10^0 approximately 1
10^(-2.6) approximately 0.0025
So the heterozygous case is the most likely and is indicated as such by the PL values.
mhm. Thanks for this. I was reading just yesterday page from GATK, on which there was a mistake (it really confused me).
Worth noting though that the most LIKELY genotype is not always the called one, that comes after the PROBABILITY is calculated. Note for instance the first line here:
GT:PL:DP:DV:GP:GQ 0/1:26,3,0:1:1:23,1,5:5
GT:PL:DP:DV:GP:GQ 0/1:27,0,35:4:2:23,0,44:23
GT:PL:DP:DV:GP:GQ 0/1:27,3,0:1:1:27,2,3:3
GT:PL:DP:DV:GP:GQ 0/1:28,3,0:1:1:26,1,4:4
The PL field (26,3,0) suggests 1/1 as the most likely genotype. The GP field (23,1,5) shows 0/1 is the most probable: so 0/1 is called. Such cases seem to almost always occur with really low read depths and genotype qualities.
So do you know why this is happening?