Installing ensembl 86 VEP and vcf2maf and getting SNP MAFs
1
3
Entering edit mode
7.4 years ago
eurioste ▴ 20

My final objective is to get the "Minor Allele Frequencies" (MAF) for all the 1000 Genomes SNPs (in H. sapiens GRCh37 in case you ask). I specifically need to obtain data referent to the low coverage Phase 1 of the project, as I require unbiased low coverage data for a machine learning model.

I have the 1000 Genomes vcf and I'm attempting to install both VEP 86 and vcf2maf for obtaining the data i need. The reason I wish to install VEP 86 (instead of the current version, 89) is because vcf2maf requires the archive version of VEP, I don't know how to make it work with the latest VEP version.

As pointed by this previous question www.biostars.org/p/123822/) I'm following the instructions from this link to get vcf2maf installed: vcf2maf

which points also to this VEP installation instructions: VEP

I successfully installed perl 5.22 in the path require by VEP, as described in this link bellow. This step is done. perl

I'm currently stuck at the following step of the VEP installation (again, see VEP ):

Download and unpack VEP's offline cache for GRCh37, GRCh38, and GRCm38:

> rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-86/variation/VEP/homo_sapiens_vep_86_GRCh{37,38}.tar.gz $VEP_DATA 
> rsync -zvh rsync://ftp.ensembl.org/ensembl/pub/release-86/variation/VEP/mus_musculus_vep_86_GRCm38.tar.gz $VEP_DATA 
> cat $VEP_DATA/*_vep_86_GRC{h37,h38,m38}.tar.gz | tar -izxf - -C $VEP_DATA

I know the path given in the instructions is wrong. When I try it the code runs but hangs forever:

ftp.ensembl.org/ensembl/pub/release-86/variationVEP/homo_sapiens_vep_86_GRCh37.tar.gz

The current right path is bellow. Notice that I'm only interested in human GRCh37:

ftp.ensembl.org/pub/release-86/variation/VEP/homo_sapiens_vep_86_GRCh37.tar.gz

When I attempt to correct the line I get:

> rsync -zvh rsync://ftp.ensembl.org/pub/release-86/variation/VEP/homo_sapiens_vep_86_GRCh37.tar.gz $VEP_DATA
@ERROR: Unknown module 'pub'
rsync error: error starting client-server protocol (code 5) at main.c(1653) [Receiver=3.1.1]
sergio-bioinfo@sergiobioinfo-Latitude-3540:~/vep$ rsync -zvh rsync://ftp.ensembl.org/pub/release-86/variation/VEP/homo_sapiens_vep_86_GRCh37.tar.gz $VEP_DATA
@ERROR: Unknown module 'pub'
rsync error: error starting client-server protocol (code 5) at main.c(1653) [Receiver=3.1.1]

I don't know how to work around this problem. How can I fix this and follow the instructions correctly to get VEP and vcf2maf work together?

1000Genomes VEP MAF vcf2maf Ensembl • 3.9k views
ADD COMMENT
1
Entering edit mode
7.2 years ago

For your final objective, you should not use vcf2maf. The "MAF" in vcf2maf refers to "Mutation Annotation Format", which was something unncessarily invented and confusingly named for cancer genetics. This possible confusion was already disambiguated in the post that you pointed to.

To reach your final objective, please use my answer to your previous post here - Getting 1000 Genomes phase one MAF values

For the benefit of users that ran into your VEP installation issues:

  1. The path given in the instructions is for the rsync protocol, not for ftp. Read more about this at this link.
  2. The rsync step is supposed to be slow, and it will take a long time. The VEP caches for GRCh37 and GRCh38 are almost 5GB each, and Ensembl's servers can be slow. The advantage of using rsync is that it can resume partial downloads that were aborted by impatient users.
ADD COMMENT
1
Entering edit mode

The "MAF" in vcf2maf refers to "Mutation Annotation Format", which was something unncessarily invented and confusingly named for cancer genetics.

So true! This also conflicts with Multiple Alignment Format

ADD REPLY
1
Entering edit mode

And Minor Allele Frequency!

ADD REPLY

Login before adding your answer.

Traffic: 2714 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6