Genotype Likelihoods
3
6
Entering edit mode
11.4 years ago
Varun Gupta ★ 1.3k

Hi Everyone

I am starting to find SNP's in my dataset and i am reading online and i come across the term genotype likelihoods. Can you explain me what it means.

Thanks

V

genotype • 22k views
ADD COMMENT
5
Entering edit mode
11.4 years ago
stolarek.ir ▴ 700

@Dan Gaston

However there are situations in VCF outputs when the most probable genotype is different from the one that is reported, for example:

Having a VCF file looking like:

CHROM POS ID Ref Alt Filter GT:AD:DP:GQ:PL

chr1 845668 . C T [CLIPPED] 0/1:1,3:4:25,92:103,0,26

Lets focus on the GT field and PL:

0/1, 103,0,26

GT is given 0/1, so the heterozygous however PL field reports most probable genotype as 1/1[value 26 = ~ 25% of chance that this is the correct one]. other genotypes from PL field: 0/1[0 probability], 0/0[value 103, also very small probability]. The GT reported as 0/1 comes in this case however from DP field, which shows 1,3:4 - meaning 4 reads span this position, out of which 3 report ALT allel, and 1 report REF allele. So one has to be careful when calling GT, the most probable position is encoded in PL field (even though it's not always given in VCF)

ADD COMMENT
16
Entering edit mode

That's not a correct interpretation of the PL field. PL values are phred-scaled likelihood scores, normalized such that the most likely genotype will have a score of 0. So the approximate likelihoods are 10^(-PL/10). In this case, for PL values of 103,0,26 the likelihoods would be

10^(-10.3) approximately 5.0E-11

10^0 approximately 1

10^(-2.6) approximately 0.0025

So the heterozygous case is the most likely and is indicated as such by the PL values.

ADD REPLY
0
Entering edit mode

mhm. Thanks for this. I was reading just yesterday page from GATK, on which there was a mistake (it really confused me).

ADD REPLY
0
Entering edit mode

Worth noting though that the most LIKELY genotype is not always the called one, that comes after the PROBABILITY is calculated. Note for instance the first line here:

GT:PL:DP:DV:GP:GQ 0/1:26,3,0:1:1:23,1,5:5

GT:PL:DP:DV:GP:GQ 0/1:27,0,35:4:2:23,0,44:23

GT:PL:DP:DV:GP:GQ 0/1:27,3,0:1:1:27,2,3:3

GT:PL:DP:DV:GP:GQ 0/1:28,3,0:1:1:26,1,4:4

The PL field (26,3,0) suggests 1/1 as the most likely genotype. The GP field (23,1,5) shows 0/1 is the most probable: so 0/1 is called. Such cases seem to almost always occur with really low read depths and genotype qualities.

ADD REPLY
0
Entering edit mode

So do you know why this is happening?

ADD REPLY
0
Entering edit mode
11.4 years ago
DG 7.3k

Definitely read the link that stolarek.ir posted. I thought I would just briefly state for a very short and oversimplified answer that most genotype callers use some sort of probabilistic model for determining whether a position matches the reference assembly used or has one or more variant alleles at that position. They also typically have different models for SNPS versus indels. Most of these models are Bayesian and therefore the genotype likelihood, in plain language, is the probability of a specific genotype given the data nucleotides at that position from the aligned reads that pass some filter(s). The genotype with the best likelihood (highest probability) is picked as the observed genotype.

ADD COMMENT
2
Entering edit mode

This is incorrect, the genotype likelihood is not "the probability of a specific genotype given the data" but the other way around; it is the probability of the data given a specific genotype.

ADD REPLY

Login before adding your answer.

Traffic: 1750 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6