Hi,
I am generating 2 distinct SNVs/small indels for each sample from WGS data. The two VCFs are generated by gatk HaplotypeCaller
and deepvariant
.
After filtering out low quality calls for either VCF, I would now like to merge them into a single VCF to be annotated.
This VCF, however, should keep the information about the variant caller that called each variant.
For example, the two caller-specific VCFs might look like this:
HaplotypeCaller VCF
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sampleID
chr1 10583 . G A 405.64 PASS AC=1 GT:AD:DP:GQ:PL 0/1:12,16:28:99:413,0,320
chr1 10622 . T G 205.67 PASS AC=1 GT:AD:DP:GQ:PL 0/1:1,8:9:12:213,0,12
chr1 10623 . T C 189.97 PASS AC=2 GT:AD:DP:GQ:PL 1/1:1,6:7:18:204,18,0
DeepVariant VCF
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sampleID
chr1 10583 . G A 48.8 PASS . GT:GQ:DP:AD:VAF:PL 0/1:50:28:12,16:0.43:413,0,320
chr1 14907 . A G 24.4 PASS . GT:GQ:DP:AD:VAF:PL 0/1:23:68:22,46:0.676471:24,0,27
The merged VCF should have a new INFO filed VC
for "Variant Caller" and show HC
if found in the HaplotypeCaller VCF and DV
when found in the DeepVariant VCF. When found in both, it should show HC/DV
:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sampleID
chr1 10583 . G A 405.64/48.8 PASS VC=HC/DV;AC=1 GT:GQ:DP:AD:VAF:PL 0/1:99:28:12,16:0.43:413,0,320
chr1 10622 . T G 205.67 PASS VC=HC;AC=1 GT:GQ:DP:AD:VAF:PL 0/1:12:9:1,8:0.11:213,0,12
chr1 10623 . T C 189.97 PASS VC=HC;AC=2 GT:GQ:DP:AD:VAF:PL 1/1:18:7:1,6:0.14:204,18,0
chr1 14907 . A G 24.4 PASS VC=DV GT:GQ:DP:AD:VAF:PL 0/1:23:68:22,46:0.68:24,0,27
Do you think it makes to merge the VCFs in this manner or would you say that it is better to treat them separately and only merge after annotation using a custom python
script?
Thanks for your help and thoughts!