Hello,
I'm looking for an existing script (widely used, well tested, etc.) that can parse TAB delimited output from VEP. From a bunch of online searches, there seems to exist tools such as bcftools plugin split-vep, Pierre's bcfr, etc that can parse VEP's VCF output to proper tab delimited format where columns are consistent across the dataset.
However, VEP's own tab delimited format is pretty irregular, with the "Extra" column containing a variable number of KEY-VALUE annotation pairs per variant. Is there a tool that can accept a VEP TSV file and create a structured TSV where all KEYs are in their own columns and the VALUEs in that column are either the appropriate VALUE from the KEY=VALUE entry or a "." if the KEY does not exist for that entry? For example, here are two sample input lines:
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
rs62635297 chr1:14653 T ENSG00000223972 ENST00000456328.2 Transcript downstream_gene_variant - - - - - rs62635297 IMPACT=MODIFIER;DISTANCE=244;STRAND=1;PICK=1;SYMBOL=DDX11L1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:37102;BIOTYPE=processed_transcript;CANONICAL=YES;TSL=1
rs62635297 chr1:14653 T ENSG00000227232 ENST00000488147.1 Transcript intron_variant,non_coding_transcript_variant - - - - - rs62635297 IMPACT=MODIFIER;STRAND=-1;SYMBOL=WASH7P;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:38034;BIOTYPE=unprocessed_pseudogene;CANONICAL=YES;HGVSc=ENST00000488147.1:n.1254-152G>A
What I'd like as output:
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation IMPACT DISTANCE STRAND SYMBOL SYMBOL_SOURCE PICK HGNC_ID BIOTYPE CANONICAL TSL HGVSc
rs62635297 chr1:14653 T ENSG00000223972 ENST00000456328.2 Transcript downstream_gene_variant - - - - - rs62635297 MODIFIER 244 1 DDX11L1 HGNC 1 HGNC:37102 processed_transcript YES 1 .
rs62635297 chr1:14653 T ENSG00000227232 ENST00000488147.1 Transcript intron_variant,non_coding_transcript_variant - - - - - rs62635297 MODIFIER . -1 WASH7P HGNC . HGNC:38034 unprocessed_pseudogene YES . ENST00000488147.1:n.1254-152G>A
I could use some R to write a script myself, but I thought I'd check if there's something already out there that the community uses.
Thank you for your time!
Not what you were asking but Pierre had this in a previous thread: Is there a better tool for visualizing your variants after annotation than Excel?
Hi! I am in the same situation now :). Did you find any well-documented solution?
I think I ended up writing some bcftools code (
bcftools +split-vep
IIRC). I'll share the exact code tomorrow.I wrote some complicated code. First off, the
-l
option inbcftools +split-vep
lets me list all annotation keys, which I use to construct a format string, which I then pass tobcftools +split-vep
again to get the tsv file:Thanks for sharing it :)