Repeated information on specific parameters of the INFO field - VCF files - EBivariation vcf-validator
1
0
Entering edit mode
6.5 years ago
daianagan ▴ 40

Hello everyone! I am new to manipulating VCF files, and they recommended me the EBIvariation/vcf-validator to check that the file is correctly put. I got from my variant calling (I don't do it, it's the output of the service we pay for) a vcf file which has many repeated info in the INFO field of it, for example:

AA=p.K2811fs46,p.K2811fs46; CDS=c.8426delA,c.8426delA; CNT=1,1

Apparently, having "p.K2811fs*46" twice is not valid, so I should keep only one.

I cannot yet find any tool that does this (not sure if there even is one at all), but any help is very welcome!!!

vcf next-gen vcf-validator • 2.3k views
ADD COMMENT
0
Entering edit mode

Hello daianagan,

could you please post the complete header from the vcf file and the first 5-10 variants.

fin swimmer

ADD REPLY
0
Entering edit mode

Sorry, I didn't realize I was answering as a new comment

ADD REPLY
0
Entering edit mode

Hello Fin! Thanks for your reply, here is what you've asked for. I've attached it, since the format when copying here was a mess.

ADD REPLY
0
Entering edit mode

Hello daianagan,

in your example vcf I could not find any repeated information. Do I overlook something? If there are no repeated information for every entry please add some examples which have.

fin swimmer

ADD REPLY
0
Entering edit mode

So sorry about that. It's updated now, the last one has, among others, the AA info duplicated. Thank you!

ADD REPLY
0
Entering edit mode

This is VEP annotated vcf and this example vcf doesn't have OP entries.

ADD REPLY
0
Entering edit mode

Hi cpad! Thank you for your reply. If it is not too much to ask, can you briefly explain to me what a VEP annotated vcf mean? Why would this bring any trouble? Also, what do OP entries are? Thank you!!!

ADD REPLY
0
Entering edit mode

Original VCF was functionally annotated with VEP as the tags in the OP (original post) are inline with VEP output. Duplicate entries you have posted at the start are not present in VCF file you have shared. Apparently, that duplicate p syntax might be due to multiple transcripts being affected by that variation. One needs to be careful before annotating the output.

ADD REPLY
4
Entering edit mode
6.5 years ago

Hello daianagan,

the problem with your vcf is not just that there are duplicate values for some INFO field, but in the header there is also defined that these fields only hold 1 entry.

##INFO=<ID=CDS,Number=1,Type=String,Description="CDS annotation">
##INFO=<ID=AA,Number=1,Type=String,Description="Peptide annotation">
##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">
##INFO=<ID=CNT,Number=1,Type=Integer,Description="How many samples have this mutation">
##INFO=<ID=STRAND,Number=1,Type=String,Description="Gene strand">

The "Number" defines how many values are allowed. For more information see the manual.

In your example there are not just duplicates. Look at this:

GENE=ATM_ENST00000278616,ATM

This entry has two different values, but only one is allowed.

Here's a little python script which iterates over all records on your vcf and truncate all INFO fields to the number given in the header.

Save the code as fixDuplicates.py and run it like this: $ python fixDuplicates.py prueba.vcf > prueba_corrected.vcf

The script makes use of pysam. You have to install this package first.

fin swimmer

ADD COMMENT

Login before adding your answer.

Traffic: 1489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6