Question

How do genotype likelihoods (PL) work in gVCF files?

0

Entering edit mode

5.5 years ago

popantrop ▴ 50

I ran GATK4 on a single WES BAM file from The Cancer Genome Atlas (TCGA) program.

This is how an example SNP call looks like:

chr22   10685693    .   A   G,<NON_REF> 307.60  .   BaseQRankSum=1.267;DP=52;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-4.582;RAW_MQandDP=91525,52;ReadPosRankSum=0.279 GT:AD:DP:GQ:PL:SB   0/1:34,18,0:52:99:315,0,697,414,751,1165:18,16,1,17

As you can see the PL field features six comma-separated integers: 315,0,697,414,751,1165. My understanding derived from the answer here is that this should correspond to the following (phred-adjusted) genotype likelihoods (N corresponds to NON_REF):

AA: 315
AG: 0
GG: 697
AN: 414
GN: 751
NN: 1165

It appears we have overwhelming evidence of a heterozygous SNP A/G: 34 calls for A, 18 calls for G and 0 calls for anything else. This data makes sense if you follow the formula for computing PL as shown here.

Briefly, probability of genotype given data is going to be very high (almost 1) for AG, and $-10* log(1) = 0$. Conversely it will be very low for CC, CT or TT (NON_REF), in the order of $10^-100$, and $-10 * log(10^-100) ~ 1165$. Given allelic imbalance of A vs G, it also makes sense that AA is more likely than GG.

EDIT

As pointed out in the comments, gVCF file shouldn't be used in downstream analysis, one can combine (multiple) gVCF files(s) into finished genotypes using the GenotypeGVCFs command of gatk. This gets rid of NON_REF SNPs, with the result, which looks like this:

chr22   10685693        .       A       G       307.60  .       AC=1;AF=0.500;AN=2;BaseQRankSum=1.27;DP=52;ExcessHet=3.0103;FS=31.510;MLEAC=1;MLEAF=0.500;MQ=41.95;MQRankSum=-4.582e+00;QD=5.92;ReadPosRankSum=0.279;SOR=4.04  GT:AD:DP:GQ:PL  0/1:34,18:52:99:315,0,697

With the PL field corresponding to:

AA: 315
AG: 0
GG: 697

VCF SNP genotype • 1.7k views

ADD COMMENT • link 5.5 years ago by popantrop ▴ 50

0

Entering edit mode

you're currently looking at a *.g.vcf file, you'd better call GenotypeGVcf before looking at those values...