We recommend using Ensembl's neatly organized and properly versioned resources. For an entire project, make a decision on which Ensembl release you want to work with, and then stick with that. This will ensure a standard reference sequence, standard gene names, standard transcript/isoform definitions, standard COSMIC/OMIM/dbSNP/ENCODE datasets, etc.
If you're working with GRCh37 (aka hg19 or Build37), then use the latest Ensembl release 73. Here's some commands to grab the reference FASTA from their servers, and index it with SAMtools:
curl -LO ftp://ftp.ensembl.org/pub/release-73/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.73.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh37.73.dna.primary_assembly.fa.gz
samtools faidx Homo_sapiens.GRCh37.73.dna.primary_assembly.fa
Here's a bonus tip on creating an --roi-file
for MuSiC based on Ensembl 73 isoforms. Start with the GTF transcript annotation file:
curl -LO ftp://ftp.ensembl.org/pub/release-73/gtf/homo_sapiens/Homo_sapiens.GRCh37.73.gtf.gz
gunzip Homo_sapiens.GRCh37.73.gtf.gz
Make a BED file listing CDS/ncRNA loci for each gene:
perl -ne 'chomp; @c=split(/\t/); $c[3]--; $c[8]=~s/.*gene_name\s\"([^"]+)\".*/$1/; print join("\t",@c[0,3,4,8,5,6])."\n" if($c[2] eq "CDS" or $c[2] eq "exon")' Homo_sapiens.GRCh37.73.gtf > Homo_sapiens.GRCh37.73.all_exon_loci.bed
Create another BED file that merges overlapping exons on the same strand (you'll need BEDtools' mergeBed):
mergeBed -s -nms -i Homo_sapiens.GRCh37.73.all_exon_loci.bed | perl -pe 's/;.*//' | cut -f 1-4 > Homo_sapiens.GRCh37.73.merged_exons.bed
Create an --roi-file
for use with MuSiC, which adds 2bp flanks (for splice junctions) to each exon, and uses 1-based start and stop loci:
perl -ane '$F[1]--; $F[2]+=2; print join("\t",@F[0..3])."\n";' Homo_sapiens.GRCh37.73.merged_exons.bed > ensembl73_roi_file_for_music
This blind addition of 2bp flanks was reported to take at least 1 ROI outside the bounds of its sequence (an exon of AL603926.1 on GL000223.1). So here is a slightly complicated, but safer script, to create an --roi-file
for MuSiC. It loads the FASTA sequence lengths into a hash called %chrEnd
, and checks each ROI against this, after adding 2bp flanks:
perl -e '%chrEnd=map{chomp; split(/\t/)}`cut -f 1,2 Homo_sapiens.GRCh37.73.dna.primary_assembly.fa.fai`; map{($c,$s,$e,$g)=split(/\t/); $s--; $e+=2; $s=1 if($s<1); $e=$chrEnd{$c} if($chrEnd{$c} and $e>$chrEnd{$c}); print "$c\t$s\t$e\t$g"}`cat Homo_sapiens.GRCh37.73.merged_exons.bed`' > ensembl73_roi_file_for_music
Update: human_g1k_v37
and GRCh37
have chromosomes/contigs of equal length, including MT. So genomic loci for either reference should refer to the same thing. The nucleic acid sequence is also likely the same, because all the pending patches to GRCh37
will only be made official when GRCh38
is released. If someone had the patience to a do a diff
between the FASTAs, then please let us know what you find!
I think Broad uses the "chr" prefix on all of their chromosomes, so that alone will screw up matching if your ROIs are in the format you provided.
This is true. You need to ensure that you use the same chromosome naming conventions in your BAM file, FASTA, MAF file, and ROI file.