I was analysing sequences for two different samples and after creating VCF files we noticed that insertions are sometimes expressed different. This happens, even though we used exactly the same process to create them.
Here are three examples (lines have been truncated for brevity):
10 11805838 . C CT 211.50 . AC=11;AF1=1;AN=12;CI95=1,1;DP4=0,0,19,11;DP=38;FQ=-125;G3=7.906e-63,1e-27,1;INDEL;MQ=49;PV4=1,1,0.21,1;SF=0,1,2,3,4,5
10 11805838 . CG CTG 200.33 . AC=8;AF1=0.5;AN=12;CI95=0.5,0.5;DP4=7,7,10,4;DP=36;FQ=194;G3=7.924e-47,1,5e-50;INDEL;MQ=49;PV4=0.44,1,0.077,1;SF=0,1,2,3,4,5
X 122318386 . A AG 214.00 . AC=12;AF1=1;AN=12;CI95=1,1;DP4=0,0,10,12;DP=24;FQ=-101;G3=3.147e-53,1.585e-20,1;INDEL;MQ=48;SF=0,1,2,3,4,5
X 122318386 . AC AGC 214.00 . AC=12;AF1=1;AN=12;CI95=1,1;DP4=0,0,15,12;DP=29;FQ=-116;G3=3.147e-59,5.012e-25,1;INDEL;MQ=45;SF=0,1,2,3,4,5
11 118889247 . AT AGT 209.00 . AC=12;AF1=1;AN=12;CI95=1,1;DP4=0,0,8,3;DP=16;FQ=-67.5;G3=1.252e-43,6.308e-14,1;INDEL;MQ=36;SF=0,1,2,3,4,5
11 118889247 . A AG 214.00 . AC=12;AF1=1;AN=12;CI95=1,1;DP4=0,0,7,6;DP=14;FQ=-73.5;G3=3.147e-50,2.512e-16,1;INDEL;MQ=47;SF=0,1,2,3,4,5
Unless I'm missing something obvious, these are three equivalent variants expressed differently (the same insertion is expressed in two different ways).
This ambiguity makes it harder to compare results. We used BWA + Samtools + VcfTools to create both files (using exactly the same parameters).
My questions are:
1-) Shouldn't VCF standard specify that the representation should be minimal to avoid this kind of confusion?
2-) Is there an easy way to fix the VCF files to avoid this?
Thank you.
I do agree with you that it "should" be incorrect. But, I've read the norm, and I don't see where it specifies that this is actually incorrect (may be I missed it?).
It does say that you have to express the "alternate non-reference alleles", but it doesn't say it should be in a minimal way. Again, I agree it should say that, otherwise you could just write the whole chromosome starting from that position, and you would be complying with the norm (which is ridiculous).
The other problem is that the same software creates both forms.