Entering edit mode
11.0 years ago
Allpowerde
★
1.3k
It seems that for some variants (predominantly RPLs) pindel2vcf records the wrong END-value (which subsequently GATK complains about "ERROR MESSAGE: BUG: GenomeLoc 9:17427298-17427314 has a size == 18 but the variation reference allele has length 17") e.g.
chr9 17427298 . TAGATTTTTCAGCAATAC AGATTTTTCAGCAATACA . PASS END=17427314;HOMLEN=0;NTLEN=18;SVLEN=-18;SVTYPE=RPL GT:AD 0/0:0,4
The variant starts at 17427298 and affect a stretch of 18nt (ref allele) so the end should be 17427298+18-1=17427315 not 17427314.
Why does this occur?
In the meantime: this python snippet will correct the END values
import sys
filename=sys.argv[1]
outfile=open(sys.argv[2],"w")
countall=0
count=0
for i in open(filename):
if i.find("#")>-1:
outfile.write(i)
else:
countall+=1
content=i.split("\t")
start=int(content[1])
end=int(i.split("END=")[1].split(";")[0])
length=len(content[3])
if (start+length-1!=end):
count+=1
#print "update length %i %i : %s" % (start+length-1, end, i)
content[7]=content[7].replace("END="+str(end), "END="+str(start+length-1))
outfile.write("\t".join(content))
print "%i updated out of %i" % (count, countall)