Hi,
What memory usage and run time for VEP whole genome variant annotation? I tried annotate a 5 sample Illumina 30x coverage whole genome VCF:
perl /ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --force_overwrite -i G85829.vcf --cache --assembly GRCh37 --offline --individual all \
--symbol \
--numbers \
--biotype \
--total_length \
-o output \
--vcf \
--fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,Protein_position,BIOTYPE
This command runs out of memory after ~11 hours. there is about 20Gb free memory on ubuntu server:
[==============================================================================] [ 100% ]
2015-03-05 22:14:40 - Processed 20675000 total variants (238 vars/sec, 547 vars/sec total)
2015-03-05 22:14:41 - Read 5000 variants into buffer
2015-03-05 22:14:41 - Reading transcript data from cache and/or database
[=====================================> ] [ 50% ]ERROR: Cannot allocate memory at /ensembl-tools-release-78/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4735, <GEN0> line 4136138.
The processed total variants number is close to total variants from GATK calling (,each genome has about 3.5M total SNPs and .5 total indels, 20,917 total). So I wonder if something happen after all variants were processed. For comparison it takes few minutes for ANNOVAR to annotate one genome.
Vlad
Thanks! it worked. greatly reduced memory footprint and took about 10 hours to process 42M variants on 8 cpus. I used
--fork 6
:Probably can be further optimized via the batch size option
Vlad