How to interpret genotype likelihoods?
1
2
Entering edit mode
10.5 years ago
philip ▴ 10

I'm browsing the 1000 genomes project data and looking at the VCF files from phase 1 (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/)

Based on genotype likelihoods (GL) I want to calculate a phred-like score for the most likely genotype (per individual). But I'm a bit stuck on the interpretation of the genotype data in these files.

For instance:

REF   ALT   FORMAT     HG00099                           ...  
A       C       GT:DS:GL   0|0:0.000:-0.10,-0.69,-4.10   ...

As I read this, the observed (most likely) genotype is AA (0|0). DS has the appropriate score (dosage of ALT allele = 0.000). But I can't work out how to interpret the GLs. These look log10-scaled (as explained in the documentation) and normalized for the most likely genotype but if that is the case why is the first value -0.10 instead of 0? Can anybody help me out on this?

For my purpose, the GQ value would be great. Why isn't it in these VCFs? I suppose you can get it by summing the likelihoods of the less likely genotypes and make a Phred-like score?

Cheers

genotype-likelihood VCF next-gen • 8.9k views
ADD COMMENT
0
Entering edit mode

Where is the "documentation"? The VCF spec does not require GL to be normalized, and therefore not having 0 is okay.

ADD REPLY
0
Entering edit mode

I was only referring to the documentation regarding the log10 scale. I assumed normalization by analogy with PL (Genotype Likelihoods). Thanks for the clarification

ADD REPLY
1
Entering edit mode
10.5 years ago
Gabriel R. ★ 2.9k

why is the first value -0.10 instead of 0?

Without seeing the coverage and quality scores, it is hard to work back the calculation that was done. But If the likelihood of the sample being homozygous reference 10^-0.10 = 0.79 is greater than the heterozygous case (10^-0.69=0.20). Now, why is it not 0 ? Maybe you have an alternative allele with a low base quality or the bases for the reference are few and have low quality themselves. Again, without seeing the reads per se, I am conjecturing.

For my purpose, the GQ value would be great.

What is your purpose ? The GL should give you all the information you need. When I wish to retain sites that pass a certain quality criterion, I compute the difference between the log likelihoods (essentially, a likelihood ratio) and apply a threshold to it as to how much certainty I am willing to have.

ADD COMMENT
0
Entering edit mode

Thanks for your reply. Again, my questions were based on the false assumption the GLs are normalised values. So thanks for clarifying this

Indeed, a likelihood ratio will suffice for my purposes. For instance, transforming the GLs to the actual likelihoods and make the ratio: Likelihood ratio = Likelihood(most likely genotype)/(Likelihood(alternative genotype 1) + Likelihood(alternative genotype 2) for a biallelic variant.

Could you give me an idea of a good treshold value for this ratio? For the given example this ratio would be ~3.9 (= 0.79/(0.20 + 7.93E-05) showing this call isn't very reliable. What can be accepted statistically as a good treshold for this ratio?

Thanks!

ADD REPLY
0
Entering edit mode

Start from first principles, what would you be willing to tolerate without loosing too many sites ?

ADD REPLY
0
Entering edit mode

Loosing sites is not an issue, definitely not at first. To start with, I want some idea about threshold value for these likelihood ratios that can be considered as statistically significant (ie a good call).

To give you some context: I'm setting up a 'dummy' genetic database for a clinical research project. I'm using the 1K Genomes project as a proxy for clinically relevant genetic data because we have no 'real' genetic data yet. My work is part of a proof-of-concept validation for a clinical/genetic project where clinical data is compared with genetic data and is coming from different databases (real life situation).

Since a lot of data is still low coverage in the 1000 Genomes Phase 1, I need some threshold (or several) to determine if the genotype calls can be viewed as reliable or not. I suppose in practice, that comes down to a threshold that is generally accepted to yield a good sensitivity/specificity. For instance (and correct me if I'm wrong) base calls with Phred scores higher than 30 are considered to be quite reliable calls.

In that regard, I'm looking for a meaningful threshold value for these likelihood ratio which can be considered as good genotype calls, in analogy to Phred for base calling. As an example, using the same logic for Q30 (1/1000 that the call is wrong) a likelihood ratio of about 1000:1 can be considered a good call. Does this make any sense? :-)

What would be a good way to go about this? I greatly appreciate the input.

Cheers

ADD REPLY
0
Entering edit mode

A way you could do it is by downloading the phase III data from 1000g. To my dismay, there are no GL nor PL, they just have GT fields.

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

It's a cop-out but it could be a way to go forward with your project

ADD REPLY
0
Entering edit mode

This is a pre-release. I guess the final release will include GL/PL.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

The genotypes are imputed. Imputation adds a lot of power.

ADD REPLY

Login before adding your answer.

Traffic: 1941 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6