htsjdk.tribble.TribbleException: The provided VCF file is malformed
2
0
Entering edit mode
3.0 years ago
Egelbets ▴ 30

I have VCF files that I want to convert to a more readable TSV file using GATK VariantsToTable, and I also want to load in the VCF in IGV. However, when trying to do this, I get the same error for both operations:

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 764: unparsable vcf record with allele Y

VCF at line 764:

Malacosoma  119981  .   Y   C   16388   PASS    DP=1374;AF=0.908297;SB=0;DP4=0,0,537,711

The REF is Y, and the ALT C. It seems that whatever this htsjdk.tribble is, that it can't work with IUPAC nucleotide codes (I also get this error with other IUPAC codes in other VCF files). Does anyone know a workaround for this?

VCF tribble GATK igv • 1.8k views
ADD COMMENT
1
Entering edit mode
3.0 years ago

It seems that whatever this htsjdk.tribble is

a package in the htsjdk library: https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/tribble/package-summary.html

Does anyone know a workaround for this?

awk -F '\t' '/^#/ {print;next;} {OFS="\t";if($4=="Y") $4="N"; print;}' < input.vcf
ADD COMMENT
0
Entering edit mode
15 months ago

I had a similar problem using GATK BaseRecalibrator and Ensembl v103 VCF files for --known-sites. I initially thought that it was a version problem but even using the latest version for GRCh38.p13 (v109) I still had the same issue.

Errors: A USER ERROR has occurred: Error while trying to create index for homo_sapiens-chr3.vcf. Error was: htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 4431575: Symbolic alleles not allowed as reference allele: <W>

A USER ERROR has occurred: Error while trying to create index for homo_sapiens-chr17.vcf. Error was: htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 19739670: Symbolic alleles not allowed as reference allele: <Y>

So I just got rid of all symbolic refs, characterized by having <>. However if you are dealing with CNVs or Structural variants this is not recommended:

input_vcf="homo_sapiens-chr17.vcf"
output_vcf="homo_sapiens-chr17_wo_symbolic_refs.vcf"

awk 'BEGIN {FS=OFS="\t"} /^#/ {print; next} $4 !~ /[<>]/ {print}' $input_vcf > $output_vcf

Apparently, GATK expects a single allele in the reference. After this change, it worked. P.S. for some reason only chr3 and chr17 had this problem.

ADD COMMENT

Login before adding your answer.

Traffic: 2224 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6