I ran GATK4 on a single WES BAM file from The Cancer Genome Atlas (TCGA) program.
This is how an example SNP call looks like:
chr22 10685693 . A G,<NON_REF> 307.60 . BaseQRankSum=1.267;DP=52;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-4.582;RAW_MQandDP=91525,52;ReadPosRankSum=0.279 GT:AD:DP:GQ:PL:SB 0/1:34,18,0:52:99:315,0,697,414,751,1165:18,16,1,17
As you can see the PL field features six comma-separated integers: 315,0,697,414,751,1165. My understanding derived from the answer here is that this should correspond to the following (phred-adjusted) genotype likelihoods (N corresponds to NON_REF):
AA: 315
AG: 0
GG: 697
AN: 414
GN: 751
NN: 1165
It appears we have overwhelming evidence of a heterozygous SNP A/G: 34 calls for A, 18 calls for G and 0 calls for anything else. This data makes sense if you follow the formula for computing PL as shown here.
Briefly, probability of genotype given data is going to be very high (almost 1) for AG, and $-10* log(1) = 0$. Conversely it will be very low for CC, CT or TT (NON_REF), in the order of $10^-100$, and $-10 * log(10^-100) ~ 1165$. Given allelic imbalance of A vs G, it also makes sense that AA is more likely than GG.
EDIT
As pointed out in the comments, gVCF file shouldn't be used in downstream analysis, one can combine (multiple) gVCF files(s) into finished genotypes using the GenotypeGVCFs
command of gatk
. This gets rid of NON_REF SNPs, with the result, which looks like this:
chr22 10685693 . A G 307.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.27;DP=52;ExcessHet=3.0103;FS=31.510;MLEAC=1;MLEAF=0.500;MQ=41.95;MQRankSum=-4.582e+00;QD=5.92;ReadPosRankSum=0.279;SOR=4.04 GT:AD:DP:GQ:PL 0/1:34,18:52:99:315,0,697
With the PL field corresponding to:
AA: 315
AG: 0
GG: 697
you're currently looking at a *.g.vcf file, you'd better call GenotypeGVcf before looking at those values...
Thanks, that sounds like good advise!