Question

VEP is very slow. Fork doesnt seem to work

1

Entering edit mode

4.4 years ago

nhaus ▴ 420

I am using VEP (v103) to annotate a small vcf file (~1000 variants). Nonetheless, it takes very long (>20 minutes) , which doesnt quite match their description of:

Set up correctly, VEP is capable of processing around 3 million variants in 30 minutes

Furthermore, it seems like the --fork does not really work, because the whole time just one cores is used.

This is the command that i used:

vep  --cache --dir_cache vep-cache --offline --fasta ref-genome.fa --pick --fork 4 --sift b --variant_class -i somatic.filtered.snp.vcf -o snp_vep_out.txt

Id be very thankful if someone could point out what I am doing wrong.

vep annotation • 4.7k views

ADD COMMENT • link 4.2 years ago by nhaus ▴ 420

0

Entering edit mode

I'm not a VEP user, but if you can't figure it out then you can always use another variant annotator like OpenCRAVAT. My experience is that it should only take several seconds to annotate 1000 variants (docs here: https://open-cravat.readthedocs.io/en/latest/ ).

ADD REPLY • link 4.4 years ago by Collin ▴ 1000

0

Entering edit mode

Also, as it looks like you are trying to annotate somatic mutations (likely in cancer), OpenCRAVAT has more options for predicting oncogenic mutations in cancer beyond sift. Most recent benchmarks suggests there are many other better methods for cancer (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01954-z ).

ADD REPLY • link 4.4 years ago by Collin ▴ 1000

score 2 · Answer 1 · 2021-03-01

2

Entering edit mode

4.4 years ago

Emily 24k

I assume you've already seen this documentation page based on the quote at the top. The 3M in 30 min is really the absolute fastest under ideal conditions, which means no additional flags (--pick and --sift in your command will make it slower).

Please check that your input file is sorted and that you've tabix-indexed your cache.

With regard to forking, the VEP automatically reads 5000 variants to memory in each fork, so there will be no forking if you have <5000 variants. You can change this with --buffer_size but I doubt this would increase speed much.

ADD COMMENT • link 4.4 years ago by Emily 24k

0

Entering edit mode

Thank you for you answer and sorry for just getting back now. Your explanation regarding forking makes a lot of sense!

I am writing again, because I am using VEP to annotate germline mutations, but this time more than 4 million and it takes more than a day already with 4 forks..

My input VCF is sorted, but I am not sure if I have tabix-indexed my cache. I downloaded the cache using the installer script (homo_sapiens_vep_104_GRCh37.tar.gz). I saw on this site that there exists an already indexed cache.

Can you tell me if the cache that I download via the scirpt already is the indexed cache? Furthermore, I tried to manually download the cache with:

curl -O http://ftp.ensembl.org/pub/release-104/variation/indexed_vep_cache/homo_sapiens_vep_104_GRCh38.tar.gz

tar xzf homo_sapiens_vep_104_GRCh37.tar.gz

However, this didnt work and I got this error.

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

I am currently redownloading the file and will try it again and report back.

Cheers!

EDIT:

I just tried out the convert_cache.pl script:

perl convert_cache.pl --dir . --species all --version all

which finished right away and said that no No unprocessed types remaining, so I guess my cache is already indexed, which really makes me wonder what I am doing wrong that VEP takes so long.