Hi, I'm working with GATK/4.1.2.0 on human whole-genome data.
I'm currently following the procedure to go from a gVCF to a VCF (the gVCF was obtained with HaplotypeCaller using -ERC GVCF).
The order of the tools I'm following is: GenotypeGVCFs -> VariantFiltration -> MakeSitesOnlyVcf -> VariantRecalibrator -> ApplyVQSR
Since I need to include also all the loci found to be non-variant after genotyping, I'm using the "-all-sites true" option in GenotypeGVCFs.
In the VCF I obtain from GenotypeGVCFs the majority of the 0/0 sites only have the DP in the INFO field but lack of all the other information that the VariantRecalibrator will need in a later step (e.g., QD,FS, SOR, MQ, MQRankSum, ReadPosRankSum, and InbreedingCoeff).
Is there any way to have those information for all the sites?
And if not, will the DP only be enough for the VariantRecalibrator to work on them?
For example, if I have these two sites in the VCF after GenotypeGVCFs:
chr1 10436 . C . 87.81 . DP=55 GT:AD:DP:RGQ 0/0:55,0:55:51
chr1 13868 . A G 122.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.950e-01;ClippingRankSum=-7.660e-01;DP=15;ExcessHet=3.0103;FS=15.564;MLEAC=1;MLEAF=0.500;MQ=32.73;MQRankSum=-2.534e+00;QD=8.17;RAW_MQ=16069.00;ReadPosRankSum=0.412;SOR=3.898 GT:AD:DP:GQ:PL0/1:9,6:15:99:130,0,248
Will the VariantRecalibrator need them to have the same INFO information or will it work properly in any case, even if the first site has only the DP and the second one has many other information?
I need the final VCF to include all the sites (0/0, 0/1, and 1/1). So far, everything I've tried always ended with removing all the 0/0 sites eventually.
Could someone please help me with this?
Thank you
no, those tags involve the presence of an ALT allele. for example MQRankSum:
Thank you Pierre, it makes total sense.
So, do I have to expect VariantRecalibrator to have problems with 0/0 sites or will they be maintained in the final VCF?
I don't know. Just test it. If 0/0 sites are removed, you can always merge a non-variants.vcf with recalibrated.vcf
Thank you very much, I'm running a test right now. According to the pipeline I'm following, the non-variants.vcf should be the one after the VariantFiltration step; after that step there is a high chance to loose the 0/0 sites. I'm only concerned about the quality of those sites though.
Hi, just an update regarding what you said:
I have cases like this one where an ALT allele is not present, but MQRankSum is reported anyway, as also other statistics:
I think that even if eventually a site is called as 0/0, this doesn't mean no ALT reads are present, and that's why MQRankSum can be calculated anyway.
So possibly, when stats like this one are not reported, only REF reads might be present for a specific site and that's why only DP is reported in the INFO column. On the other side, when some ALT reads are present, other INFO stats can be also calculated.
Of course, this is the explanation I've found more logic but maybe is too simplistic.