Question

Parsing VEP tab-delimited output (the "Extra" column)

1

Entering edit mode

2.3 years ago

Ram 44k

Hello,

I'm looking for an existing script (widely used, well tested, etc.) that can parse TAB delimited output from VEP. From a bunch of online searches, there seems to exist tools such as bcftools plugin split-vep, Pierre's bcfr, etc that can parse VEP's VCF output to proper tab delimited format where columns are consistent across the dataset.

However, VEP's own tab delimited format is pretty irregular, with the "Extra" column containing a variable number of KEY-VALUE annotation pairs per variant. Is there a tool that can accept a VEP TSV file and create a structured TSV where all KEYs are in their own columns and the VALUEs in that column are either the appropriate VALUE from the KEY=VALUE entry or a "." if the KEY does not exist for that entry? For example, here are two sample input lines:

#Uploaded_variation  Location    Allele  Gene             Feature            Feature_type  Consequence                                   cDNA_position  CDS_position  Protein_position  Amino_acids  Codons  Existing_variation  Extra
rs62635297           chr1:14653  T       ENSG00000223972  ENST00000456328.2  Transcript    downstream_gene_variant                       -              -             -                 -            -       rs62635297          IMPACT=MODIFIER;DISTANCE=244;STRAND=1;PICK=1;SYMBOL=DDX11L1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:37102;BIOTYPE=processed_transcript;CANONICAL=YES;TSL=1
rs62635297           chr1:14653  T       ENSG00000227232  ENST00000488147.1  Transcript    intron_variant,non_coding_transcript_variant  -              -             -                 -            -       rs62635297          IMPACT=MODIFIER;STRAND=-1;SYMBOL=WASH7P;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:38034;BIOTYPE=unprocessed_pseudogene;CANONICAL=YES;HGVSc=ENST00000488147.1:n.1254-152G>A

What I'd like as output:

#Uploaded_variation  Location    Allele  Gene             Feature            Feature_type  Consequence                                   cDNA_position  CDS_position  Protein_position  Amino_acids  Codons  Existing_variation  IMPACT    DISTANCE  STRAND  SYMBOL   SYMBOL_SOURCE  PICK  HGNC_ID     BIOTYPE                 CANONICAL  TSL  HGVSc
rs62635297           chr1:14653  T       ENSG00000223972  ENST00000456328.2  Transcript    downstream_gene_variant                       -              -             -                 -            -       rs62635297          MODIFIER  244       1       DDX11L1  HGNC           1     HGNC:37102  processed_transcript    YES        1    .
rs62635297           chr1:14653  T       ENSG00000227232  ENST00000488147.1  Transcript    intron_variant,non_coding_transcript_variant  -              -             -                 -            -       rs62635297          MODIFIER  .         -1      WASH7P   HGNC           .     HGNC:38034  unprocessed_pseudogene  YES        .    ENST00000488147.1:n.1254-152G>A

I could use some R to write a script myself, but I thought I'd check if there's something already out there that the community uses.

Thank you for your time!

tab vep • 2.1k views

ADD COMMENT • link updated 10 months ago by iraun 6.2k • written 2.3 years ago by Ram 44k

0

Entering edit mode

Not what you were asking but Pierre had this in a previous thread: Is there a better tool for visualizing your variants after annotation than Excel?

ADD REPLY • link 2.3 years ago by GenoMax 147k

0

Entering edit mode

Hi! I am in the same situation now :). Did you find any well-documented solution?

ADD REPLY • link 11 months ago by iraun 6.2k

0

Entering edit mode

I think I ended up writing some bcftools code (bcftools +split-vep IIRC). I'll share the exact code tomorrow.

ADD REPLY • link 11 months ago by Ram 44k

1

Entering edit mode

I wrote some complicated code. First off, the -l option in bcftools +split-vep lets me list all annotation keys, which I use to construct a format string, which I then pass to bcftools +split-vep again to get the tsv file:

bcftools +split-vep $vcf_file -l

0   Allele
1   Consequence
2   IMPACT
3   SYMBOL
4   Gene
5   Feature_type
6   Feature
7   BIOTYPE
8   EXON
9   INTRON
10  HGVSc
11  HGVSp
12  cDNA_position
13  CDS_position
..
..

bcftools +split-vep $vcf_file -l | cut  -f2 | tr '\n' '\t' | sed -Ee 's/\t$/\n/;s/^/%/;s/\t/\t%/g'
%Allele %Consequence    %IMPACT %SYMBOL %Gene   %Feature_type   %Feature    %BIOTYPE    %EXON   %INTRON %HGVSc  %HGVSp  %cDNA_position  %CDS_position   ..  ..  ..

bcftools +split-vep -f "%CHROM\t%POS\t%REF\t%ALT\t%FILTER\t%ID\t$(bcftools +split-vep ${vcf_f ile} -l | cut  -f2 | tr '\n' '\t' | sed -Ee 's/\t$/\n/;s/^/%/;s/\t/\t%/g')\n" -d ${vcf_file} > ${output_name}.all.tsv

ADD REPLY • link 11 months ago by Ram 44k

0

Entering edit mode

Thanks for sharing it :)

ADD REPLY • link 10 months ago by iraun 6.2k

score 1 · Answer 1 · 2022-08-15

using awk and datamash

$ awk -F '\t' '(NR==1){split($0,header);next;} {K=sprintf("%s:%s:%s:%s",$2,$1,$3,$5);for(i=1;i<=13;i++) {printf("%s\t%s\t%s\n",K,header[i],$i);}N=split($14,A,/[;]/);for(i=1;i<=N;i++) {M=split(A[i],B,/[=]/);printf("%s\t%s\t%s\n",K,B[1],(M==2?B[2]:"."));}}' < input.txt  | sort -t $'\t' -k1,1 -k2,2 | datamash crosstab 1,2 unique 3  
#Uploaded_variation                        Allele      Amino_acids  BIOTYPE  CANONICAL               CDS_position  Codons  Consequence  DISTANCE                                      Existing_variation  Feature     Feature_type       Gene        HGNC_ID          HGVSc       IMPACT                           Location  PICK        Protein_position  STRAND  SYMBOL  SYMBOL_SOURCE  TSL   cDNA_position
chr1:14653:rs62635297:T:ENST00000456328.2  rs62635297  T            -        processed_transcript    YES           -       -            downstream_gene_variant                       244                 rs62635297  ENST00000456328.2  Transcript  ENSG00000223972  HGNC:37102  N/A                              MODIFIER  chr1:14653  1                 -       1       DDX11L1        HGNC  1              -
chr1:14653:rs62635297:T:ENST00000488147.1  rs62635297  T            -        unprocessed_pseudogene  YES           -       -            intron_variant,non_coding_transcript_variant  N/A                 rs62635297  ENST00000488147.1  Transcript  ENSG00000227232  HGNC:38034  ENST00000488147.1:n.1254-152G>A  MODIFIER  chr1:14653  N/A               -       -1      WASH7P         HGNC  N/A            -