I started with a VCF that had this variant:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT [SAMPLE]
12 7045891 . ACAGCAGCAGCAGCAGCAGCAG ACAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG,ACAGCAG,ACAGCAGCAGCAGCAGCAG,ACAGCAGCAGCAGCAGCAGCAGCAG,A,ACAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG 7038.49 PASS AC=0,1,0,0,0,0;AF=0,0.5,0,0,0,0;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL 0/2:24,0,23,0,0,0,0:47:99:885,957,3825,0,2868,2797,957,3825,2868,3825,957,3825,2868,3825,3825,957,3825,2868,3825,3825,3825,957,3825,2868,3825,3825,3825,3825
I ran it through a pipeline (unfortunately, not publicly available) that's supposed to normalize variants. It came out as:
12 7045891 . A ACAGCAGCAGCAGCAG 7038.49 PASS AC=0;AF=0;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL 0/0:24,0,23,0,0,0,0:47:99:885,957,3825
The genotyope's gone from 0/2
to 0/0
, which I think would mean a change from a deletion to a normal site. Is this variant being recalled? Or am I misinterpreting?
Update: So I looked through the source code, and it looks like the genotype is getting changed by this command:
bcftools norm --multiallelics - --check-ref x --fasta-ref=${fastaRef} ${vcf} --output-type z --output ${o}
which splits the original variant into all of the following
12 7045891 . A ACAGCAGCAGCAGCAG 7038.49 PASS AC=0;AF=0;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL 0/0:24,0,23,0,0,0,0:47:99:885,957,3825
12 7045891 . ACAGCAGCAGCAGCAG A 7038.49 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL 0/1:24,0,23,0,0,0,0:47:99:885,0,2797
12 7045891 . ACAG A 7038.49 PASS AC=0;AF=0;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL 0/0:24,0,23,0,0,0,0:47:99:885,957,3825
12 7045891 . A ACAG 7038.49 PASS AC=0;AF=0;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL 0/0:24,0,23,0,0,0,0:47:99:885,957,3825
12 7045891 . ACAGCAGCAGCAGCAGCAGCAG A 7038.49 PASS AC=0;AF=0;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL 0/0:24,0,23,0,0,0,0:47:99:885,957,3825
12 7045891 . A ACAGCAGCAGCAGCAGCAGCAG 7038.49 PASS AC=0;AF=0;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL 0/0:24,0,23,0,0,0,0:47:99:885,957,3825
(Aha! The second variant in the list, with a genoytpe 0/1
, is what I would expect.) But after this command
bcftools norm -d both ${vcf} --output-type z --output ${o}
the biallelic variants are winnowed down to the single homozygous ref insertion (pasted in the second code block). (The bcftools manual page says "-d: If a record is present multiple times, output only the first instance") Why is the 0/1
variant being thrown out? Could this possibly be purposeful?
Is this in some way standard practice? Assuming the goal of the pipeline is to normalize variants and split multiallelic ones, is this a mistake? Is this a sensical way of recasting one variant as another?
Thank you so much for your response. Does my update shine any more light on the situation?
Read the
bcftools
documentation, and the corresponding output for the givensub commands
. All you have to do is read into the mechanistic details of what the tools does for the given subcommand.I've read the documentation—I now understand why each command has the effect it does. But my question stands: is this in some way standard practice? Assuming the goal of the pipeline is to normalize variants and split multiallelic ones, is this a mistake?