Error: htsjdk.tribble.TribbleExpection: The provided VCF file is malformed at approximately line number 5880: Duplicate allele added to VariantContext: GT
3
1
Entering edit mode
3.8 years ago

I am trying to index a vcf file using igvtools. For some reason, I am getting the following error.

Error: htsjdk.tribble.TribbleExpection: The provided VCF file is malformed at approximately line number 5880: Duplicate allele added to VariantContext: GT

When I got to the specific line it looks like the vcf has the reference duplicated in the alteration column. Here is what it looks like

1   19723050    rs9004957   GT  G,GT    .   .   RSPOS=19617712;RV;dbSNPBuildID=118;SAO=0;VC=in-del;VLD;VP=050000000005000100000200

When I go into the vcf and fix the line by removing the extra GT in this case, then I get another error about the same issue but just thousands of lines later in the VCF. If this happened just a couple of times I would just manually fix them but there are too many occurrences to do that in this case. I was wondering if there was a way to fix this?

SNP genome next-gen Assembly • 3.7k views
ADD COMMENT
4
Entering edit mode
3.8 years ago
awk -F '\t' '/^#/ {print;next;} {OFS="\t";R=$4;n=split($5,a,/[,]/);s="";for(i=1;i<=n;i++) {s=sprintf("%s%s%s%s",s,(i==1?"":","),a[i],a[i]==R?"AAAAAAAAA":"");} $5=s; print;}' < input.vcf
ADD COMMENT
0
Entering edit mode

That worked like a charm. I change it a bit to create a new file. Here is what I did for anyone else that encounters this error

awk -F '\t' '/^#/ {print;next;} {OFS="\t";R=$4;n=split($5,a,/[,]/);s="";for(i=1;i<=n;i++) {s=sprintf("%s%s%s%s",s,(i==1?"":","),a[i],a[i]==R?"AAAAAAAAA":"");} $5=s; print;}' old.vcf > new.vcf
ADD REPLY
0
Entering edit mode

change is you just added old.vcf > new.vcf to the code

ADD REPLY
0
Entering edit mode
3.0 years ago
Sam • 0

It's easier to use vcftools.

vcftools --remove-indels --recode --recode-INFO-all --vcf old.vcf --stdout >new.snp.vcf

ADD COMMENT
0
Entering edit mode
15 months ago

I had a similar problem using GATK BaseRecalibrator and Ensembl v103 VCF files for --known-sites. I initially thought that it was a version problem but even using the latest version for GRCh38.p13 (v109) I still had the same issue.

Errors: A USER ERROR has occurred: Error while trying to create index for homo_sapiens-chr3.vcf. Error was: htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 4431575: Symbolic alleles not allowed as reference allele: <W>

A USER ERROR has occurred: Error while trying to create index for homo_sapiens-chr17.vcf. Error was: htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 19739670: Symbolic alleles not allowed as reference allele: <Y>

So I just got rid of all symbolic refs, characterized by having <>. However if you are dealing with CNVs or Structural variants this is not recommended:

input_vcf="homo_sapiens-chr17.vcf"
output_vcf="homo_sapiens-chr17_wo_symbolic_refs.vcf"

awk 'BEGIN {FS=OFS="\t"} /^#/ {print; next} $4 !~ /[<>]/ {print}' $input_vcf > $output_vcf

Apparently, GATK expects a single allele in the reference. After this change, it worked. P.S. for some reason only chr3 and chr17 had this problem.

ADD COMMENT

Login before adding your answer.

Traffic: 2474 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6