Update 12/8/2020: Uniform realignment on GRCh38, somatic variant calling, and effect prediction is now done by NCI's GDC in MAFs downloadable from https://portal.gdc.cancer.gov. Use the instructions below only if you want to use the GRCh37 MAFs that most closely match what was published in the first batch of marker papers.
Goal:
Input: MAF files on TCGA's Firehose, that list somatic variants annotated using Broad's Oncotator
Output: MAF files re-annotated using Ensembl's VEP on GRCh37 transcripts in Gencode v19, using maf2maf
Fetch MAFs from Firehose
Firehose has a nice UI at firebrowse.org, but we'll use their command-line tool instead. They automatically select a "best MAF" per tumor type at this page, that ends up in a tarball named Mutation_Packager_Oncotated_Raw_Calls
in Firehose. But some cancer types only have calls in tarballs named Mutation_Packager_Oncotated_Calls
, where the selection of the "best MAF" was non trivial.
Fetch the firehose_get
tool, and start downloading Mutation_Packager_Oncotated_Raw_Calls
tarballs:
wget -P bin http://gdac.broadinstitute.org/runs/code/firehose_get_latest.zip
unzip bin/firehose_get_latest.zip -d bin
./bin/firehose_get -b -only Mutation_Packager_Oncotated_Raw_Calls data latest
That creates a folder structure with tarballs in separate tumor-type subfolders. Find the tumor types with empty subfolders, and download the Mutation_Packager_Oncotated_Calls
for those:
find stddata__* -type d -empty | cut -f2 -d/ | xargs ./bin/firehose_get -b -only Mutation_Packager_Oncotated_Calls data latest
Unpack all the tarballs, and rename the resulting subfolders to just the tumor type codes:
mkdir mafs
cat stddata__*/*/*/*Level_3*.tar.gz | tar -izxf - -C mafs
ls -d mafs/* | perl -ne 'chomp; ($t)=m/gdac.broadinstitute.org_(\w+)/; print "mv $_ mafs/$t\n"' | bash
Let's toss the COAD
and READ
MAFs because these samples are already accounted for under COADREAD
. In retrospect, TCGA should have given 1 tumor type code (e.g. CRC
) for these molecularly identical tumor types:
rm -rf mafs/{COAD,READ}
Conversely, let's toss the combined cohorts for GBM+LGG, Pan-Kidney, Pan-GI. We'd rather deal with the same samples in their individual tumor types:
rm -rf mafs/{GBMLGG,KIPAN,PANGI}
Set up VEP and maf2maf
Follow these instructions to set up Ensembl's VEP v86 with a local transcript and annotation cache for GRCh37. In this tutorial, I'll assume that both VEP and its cache are set up at /opt/common/CentOS_6-dev/vep/v86
.
Download and unpack vcf2maf v1.6.11, which includes the maf2maf
tool that we'll be using:
wget -P bin https://github.com/mskcc/vcf2maf/archive/v1.6.11.tar.gz
tar -zxf bin/v1.6.11.tar.gz
Run maf2maf and merge tumor-type MAFs
For each per-sample MAF, we'll run maf2maf with the following tweaks:
- Use arguments
--vep-path
and--vep-data
to point to /opt/common/CentOS_6-dev/vep/v86 - Simplify the names of the resulting MAFs in the format
sample_barcode.vep.maf
- For some genes, map variants to MSKCC's canonical isoform, rather than what VEP picks
- Retain some useful Oncotator columns from the input MAF like sequence context and dbNSFP scores
Use some in-line perl to generate maf2maf
commands for each per-sample MAF, and either submit each of those as parallel jobs on a cluster, or pipe them into bash
for sequential execution like shown below:
ls mafs/*/*.maf.txt | perl -ne 'chomp; ($s)=m/^(.*TCGA-\w\w-\w{4}-\d\d)/; print "perl vcf2maf-1.6.11/maf2maf.pl --input-maf $_ --output-maf $s.vep.maf --custom-enst vcf2maf-1.6.11/data/isoform_overrides_at_mskcc --vep-path /opt/common/CentOS_6-dev/vep/v86 --vep-data /opt/common/CentOS_6-dev/vep/v86 --ref-fasta /opt/common/CentOS_6-dev/vep/v86/homo_sapiens/86_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --retain-cols Center,Verification_Status,Validation_Status,Mutation_Status,Sequencing_Phase,Sequence_Source,Validation_Method,Score,BAM_file,Sequencer,Tumor_Sample_UUID,Matched_Norm_Sample_UUID,ref_context,i_dbNSFP_MutationAssessor_pred,i_dbNSFP_MutationAssessor_rankscore,i_dbNSFP_MutationAssessor_score,i_dbNSFP_MutationTaster_converted_rankscore,i_dbNSFP_MutationTaster_pred,i_dbNSFP_MutationTaster_score\n" unless(-s "$s.vep.maf")' | bash
Hi Cyriac,
Thank you for the excellent tutorial(s). I was following the VEP build tutorial but somehow keep running into an issue when trying to install the Ensembl API and reference FASTAs (Something to do with my perl build - will figure that another day)
Anyway I figured for the purpose of the tutorial above, I only need the reference fasta file really. So I downloaded that directly from http://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz which is where INSTALL.pl is also sourcing from. However, when trying to run the maf2maf.pl, I run into an error like -
Most obvious guess seems that the chromosome naming is not the same in the fasta and the maf files ( chr1 vs 1). Do I need to change that first in the maf files ? Or am I doing something incorrect?
2nd question -- Why should we prefer to use v84 VEP instead of v75 specially if we are working with Grch37 instead of Grch38? I initially was trying to use the release 75 of Ensembl VEP (just because all my work is using release 75 files). I amended the scripts similarly, however the v75 INSTALL.pl script does not support plugin download. Seems like the script has been heavily amended between release 75 and 84. I then proceeded to use v84 only to realise later that at the end of the day its still downloading the v75 fasta file and we wouldn't really need the plugin for this tutorial. Am I correct ?
Thanks for your time and help!
You'll need to unpack the gzipped FASTA file with
Ensembl annotation on GRCh37 was frozen at release 75; while newer software releases still support using GRCh37, most data is frozen at 75, hence why the v75 FASTA gets downloaded.