I'm using a VCF that is generated by GenotypeGVCFs (so doing calibration based on a larger cohort of samples) and my goal is to only extract variants of interest to one specific sample.
The VCF in the subset tends to include some variants that were present in the original joint VCF (i.e. variants that are present in a different sample than the one I want to isolate).
I did this with bcftools initially but it seems to not recalculate the genotype AF (essentially it retains the mean AF calculated based the genotypes across samples):
Subset using bcftools:
bcftools view -c1 -s sampleA -o bcftools_test.vcf input.vcf
chr1 6029086 rs1243967206;rs34702999;rs869154835;rs869287558;rs59913966 CAA CA,C,CAAA 3633.93 . AC=1,0,0;AF=0.458,0.083,0.083;AN=2;AS_BaseQRankSum=-0.4,-0.35,-0.5;AS_FS=9.841,4.441,0;AS_InbreedingCoeff=-0.8462,-0.0914,-0.0909;AS_MQ=69.6,68.94,70;AS_MQRankSum=0,0,0;AS_QD=10.63,0.82,1.73;AS_ReadPosRankSum=-0.2,-0.7,0.3;AS_SOR=1.572,1.465,0.605;BaseQRankSum=-0.378;DB;DP=391;ExcessHet=10.6475;FS=9.123;InbreedingCoeff=-0.6;MLEAC=11,2,2;MLEAF=0.458,0.083,0.083;MQ=69.7;MQRankSum=0;QD=12.66;ReadPosRankSum=-0.192;SOR=1.505 GT:AD:DP:GQ:PL 0/1:7,14,3,0:24:82:351,0,82,283,82,561,372,152,511,572
I can also do this with GATK using Select Variants:
gatk SelectVariants --output gatk_test.vcf --sample-name sampleA --variant input.vcf --reference genome.fasta
chr1 6029086 rs1243967206;rs34702999;rs869154835;rs869287558;rs59913966 CAA CA,C,CAAA 3633.93 . AC=1,0,0;AF=0.500,0.00,0.00;AN=2;AS_BaseQRankSum=-0.400,-0.350,-0.500;AS_FS=9.841,4.441,0.000;AS_InbreedingCoeff=-0.8462,-0.0914,-0.0909;AS_MQ=69.60,68.94,70.00;AS_MQRankSum=0.000,0.000,0.000;AS_QD=10.63,0.82,1.73;AS_ReadPosRankSum=-0.200,-0.700,0.300;AS_SOR=1.572,1.465,0.605;BaseQRankSum=-3.780e-01;DB;DP=24;ExcessHet=10.6475;FS=9.123;InbreedingCoeff=-0.6000;MQ=69.70;MQRankSum=0.00;QD=12.66;ReadPosRankSum=-1.920e-01;SOR=1.505 GT:AD:DP:GQ:PL 0/1:7,14,3,0:24:82:351,0,82,283,82,561,372,152,511,57
To be safe and after searching the documentation I wanted to make sure that any variants that weren't specific to this sample were also removed and attempted to use the --remove-unused-alternates
param.
gatk SelectVariants --output gatk_test2.vcf --sample-name sampleA --remove-unused-alternates true --variant input.vcf --reference genome.fasta
However the variant above is now gone from the output. I compared gatk_test.vcf and gatk_test2.vcf and a number of variants were removed. From what I can see all of them are INDELs and include multi-allelic variants where the ALT SNV at the location has a recalculated 0.00 VAF.
Am I missing something about the functionality of --remove-unused-alternates
my expectation would be that it would solely remove the ALT alleles at 0.00 but retain those at >=0.5. So is this expected behavior? Or maybe it's looking at some other quality metrics to make this decision?
For example, for the VCF entry above I would assume it would drop C,CAAA and keep CA as an ALT.
I'm using GATK v4.3 in case this is just a bug that was later fixed but in this situation I can't easily update to the newest version.