I am looking for a tool/script/pony to correct the REF column in a vcf file whenever that nucleotide doesn't match the reference genome, as supplied in fasta format.
It sounds like a common task but I could not find something. I did find bcftools +fixref but that only works for SNPs. My vcf files are from structural variants.
what does it currently look like in the REF/ALT column ? (why do you need to fill the REF column (thinking of a very large deletion...) in your downstream analysis ? )
My main issue is that EVA complains that the nucleotides are incorrect. It seems their software only uses POS to check the REF nucleotide, and not END.
Just some lines from the vcf:
chr1 258046 93 G <INS> 161 MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=258047;CIPOS=-10,1;CIEND=-10,1;SVLEN=126;RT=0,8,0;GAP=126;MAPQ=60,60;PID=61.386,1.698;PLENGTH=0.036,1.185;R
chr1 297535 96 A <INS> 96 MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=297536;CIPOS=-1,1;CIEND=-1,1;SVLEN=68;RT=0,2,0;GAP=68;MAPQ=60,60;PID=6.857,456.896;PLENGTH=0.409,0.005;RLEN
chr1 350805 98 A <INS> 649 MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=350806;CIPOS=-1,3;CIEND=-1,3;SVLEN=40;RT=0,11,0;GAP=40;MAPQ=36,36;PID=33.595,1.201;PLENGTH=0.094,2.548;RLEN
chr1 368841 107 G <INS> 35 SVcluster;MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=368842;CIPOS=-2,1;CIEND=-2,1;SVLEN=43;RT=0,4,0;GAP=43;MAPQ=42,42;PID=1.298,27.038;PLENGTH=0
chr1 369462 133 C <INS> 131 SVcluster;MapQual;CIPOS;CIEND IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=369463;CIPOS=-14,19;CIEND=-14,19;SVLEN=72;RT=0,6,0;GAP=72;MAPQ=56,56;PID=7.453,8.49
chr1 369520 135 A <INS> 65 SVcluster;MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=369521;CIPOS=-4,4;CIEND=-4,4;SVLEN=42;RT=0,5,0;GAP=42;MAPQ=35,35;PID=23.347,20.085;PLENGTH=
Essentially a single nucleotide would be sufficient - and in one file 98% is correct. The other file has N for every position so some more changes need to be made.
I'm already working on a python script (now posted as answer) but I thought someone else should have done it already ;p
For your narrow purpose, where all your insertions are symbolic rather than explicitly spelled out, the existing answers should be a satisfactory workaround to "bcftools +fixref"'s insistence on SNP-only data.
However, it's worth noting that there is a very good reason that bcftools insists on only SNPs. It's actually impossible to fix ordinary short indels: the only difference between a 1-base deletion and an insertion of the same base at the same position is the REF/ALT order.
Thanks! That seems to work beautifully well (and fast too).
Two comments: the warn "editing $chr:$pos $ref->$obs\n"; is off-by-one and the header is not written to the new vcf file ;-)
This script is nice to identify the variants whose REF allele is not correct and should be revised. It should be not used to change the reference alllele forcely with this script.
tagging: Pierre Lindenbaum
what does it currently look like in the REF/ALT column ? (why do you need to fill the REF column (thinking of a very large deletion...) in your downstream analysis ? )
My main issue is that EVA complains that the nucleotides are incorrect. It seems their software only uses POS to check the REF nucleotide, and not END.
Just some lines from the vcf:
Essentially a single nucleotide would be sufficient - and in one file 98% is correct. The other file has
N
for every position so some more changes need to be made.I'm already working on a python script (now posted as answer) but I thought someone else should have done it already ;p
For your narrow purpose, where all your insertions are symbolic rather than explicitly spelled out, the existing answers should be a satisfactory workaround to "bcftools +fixref"'s insistence on SNP-only data.
However, it's worth noting that there is a very good reason that bcftools insists on only SNPs. It's actually impossible to fix ordinary short indels: the only difference between a 1-base deletion and an insertion of the same base at the same position is the REF/ALT order.