htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 92283: Duplicate allele added to VariantContext: C, for input source:
1
3
Entering edit mode
3.7 years ago
DareDevil ★ 4.3k

I run variantsToTable module of gatk as follows:

gatk VariantsToTable -V 001.vcf -F CHROM -F POS -F ID -F REF -F ALT -F QUAL -F FILTER -F INFO -F FORMAT -F Father -F Mother -F Child -F ADP -F STATUS -F CSQ  -GF GT -GF GQ -GF SDP -GF DP -GF RD -GF AD -GF FREQ -GF PVAL -GF RBQ -GF ABQ -GF RDF -GF RDR -GF ADF -GF ADR -O 001.table

It has shown an error as follows:

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 92283: Duplicate allele added to VariantContext: C, for input source: 001.vcf

I checked what is there in the line number

sed -n 92283p 001.vcf

the output was:

chr1    35816798    .   CAAAAAAAAAAAAA  C,C .   PASS    ADP=20;STATUS=2;CSQ=-|intron_variant|MODIFIER|AGO4|ENSG00000134698|Transcript|ENST00000373210|protein_coding||1/17||||||||||1||HGNC|HGNC:18424|YES|CCDS397.1||  GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR    0/1:51:22:22:7:13:59.09%:6.4422E-6:38:47:6:1:13:0   0/2:20:27:27:13:6:24%:9.828E-3:38:39:13:0:6:0   0/2:22:13:13:5:6:46.15%:6.192E-3:41:38:5:0:6:0

To reformat the vcf file I run

awk -F '\t' '/^#/ {print;next;} {OFS="\t";R=$4;n=split($5,a,/[,]/);s="";for(i=1;i<=n;i++) {s=sprintf("%s%s%s%s",s,(i==1?"":","),a[i],a[i]==R?"AAAAAAAAA":"");} $5=s; print;}' 001.vcf > 001new.vcf

Again run VariantsToTable on 001new.vcf But, still it shows the same error. I removed the particular line from vcf, error has changed to another line. The think which I noticed is REF column is same allele. here C,C Any help appreciated

VariantsToTable vcf gatk • 2.4k views
ADD COMMENT
1
Entering edit mode
3.7 years ago
David Parry ▴ 150

I don't think your awk code is doing what you intend. Below is a perl one-liner that should fix lines like the above where there are 2 identical ALT alleles:

perl -wane 'if (m/^#/){ print; next;} my @alts = split(",", $F[4]); if (@alts == 2 and $alts[0] eq $alts[1]){ $F[4] = $alts[0]; foreach my $gt (@F[9..$#F]){ my @formats = split(":", $gt); $formats[0] =~ s/2/1/g; $gt = join(":", @formats); } } print join("\t", @F) . "\n";'  001.vcf > 001new.vcf

However, this fix is a total hack and only applies to variants with 2 identical ALT alleles - it will not handle situations with more than 2 ALT alleles even if 2 or more of them are identical. The fact that you have lines like this in your input suggests something is going wrong when creating your VCF so I would strongly recommend re-examining how this is happening and if at all possible use a different tool to generate your VCF.

ADD COMMENT
0
Entering edit mode

I tried this, still I get the same error

ADD REPLY

Login before adding your answer.

Traffic: 2537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6