Question

Ensembl VEP filtering most_severe keeping Symbol, Feature,Consequence etc.

2

Entering edit mode

10.0 years ago

t2 ▴ 60

Hi all,

What I need to do is filter a file produced using non-stringent Variant Effect Predictor (VEP) settings with one that was produced with more stringent VEP settings.

I've been running VEP locally using the cache option with a pre-built cache with this command on my vcfs:

perl $VEP \
  --cache \
  --dir $VEP_DIR \
  --offline \
  --input_file $input \
  --output_file $output \
  --sift b \
  --polyphen b \
  --regulatory \
  --protein \
  --symbol \
  --ccds \
  --uniprot \
  --check_existing \
  --gmaf \
  --maf_1kg \
  --maf_esp \
  --pubmed

Everything works great and I'm super happy with the documentation. However, I realized after I had run my command on all my exomes that I would most likely get many entries for each particular variant depending on different Ensembl Feature IDs.

VEP has a fix for this, which is to use the --most_severe flag when running the command. That works perfectly, however, some extra flags are disabled when using the --most_severe flag. I would like to retain this extra information (like gene name/symbol Feature,Consequence, etc.) for the variants produced with the --most_severe flag.

perl $VEP \
  --cache \
  --dir $VEP_DIR \
  --offline \
  --input_file $input \
  --output_file $output \
  --regulatory \
  --uniprot \
  --check_existing \
  --gmaf \
  --maf_1kg \
  --maf_esp \
  --most_severe

So now I have two files for each vcf; 1) disabled --most_severe and 2) --most_severe. The 2nd file is basically a subset of the 1st file but with some important missing information.

In the 1st file when there are multiple entries for a variant, most of the fields are the same except the Feature_type field and often the Extra field.

Both produce a tab delimited text file with columns such as this:

#Uploaded_variation    Location    Allele    Gene    Feature    Feature_type    Consequence    cDNA_position    CDS_position    Protein_position    Amino_acids    Codons    Existing_variation    Extra

Is there a way to filter the 1st file with the 2nd file. I think I need to use fields Uploaded_variation and Consequence for matching the 1st file because those are the fields that are unique in the line.

I think using awk to search for columns in both files won't work because there is some information lost in the Consequence field in the 2nd file

For example a variant Consequence may change from:

non_coding_transcript_exon_variant,non_coding_transcript_variant

to

non_coding_transcript_exon_variant

I appreciate any help in solving this issue. Alternatively there is a filter_vep script provided by VEP for post-VEP annotation filtering but I don't think there is an option here that will solve my problem.

Thanks,
Tesa

VEP next-gen-sequencing ensembl • 6.4k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by t2 ▴ 60

0

Entering edit mode

Emily please help him/her.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Ashutosh Pandey 12k

Ram · Accepted Answer · 2014-11-17

You will encounter problems whichever way you try to combine these two files I'm afraid. For example, let's say your small (--most_severe on) file has a line with a consequence of missense_variant. Then in your large file, there are three corresponding lines of output for that variant with a consequence of missense_variant - how do you decide which to choose?

Also, the consequence type picked by --most_severe may be calculated relative to an Ensembl feature that does not have reliable biological evidence to support it - do you still want to choose this one over any others?

Is it practical you for you re-run your analyses? There are a couple of other options that you might find useful:

a) --pick: this chooses one line of consequence data (with all the fields retained) for each variant. It uses the following criteria to pick one:

1) is the transcript canonical 2) is the transcript biotype protein_coding 3) consequence rank 4) transcript length

In the forthcoming version of VEP you will be able to customise this order.

http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_pick

--pick_allele chooses one line per variant allele (i.e. this will come into effect when the input variant has more than one alternate allele)

b) --flag_pick: like --pick but just adds a flag to the line chosen by the same rules

c) --per_gene: like --pick but chooses one line per variant/gene combination

As a footnote, we always try to discourage people from using these summary flags if we can - there will always be cases where valuable data gets lost, and you are relying on an arbitrary and subjective algorithm to perform that summarising. The logic of this algorithm will always be wrong for some use case no matter how we code it. Thus by keeping all the data you are ensuring you don't miss anything.