Best Reference Sequence For Music And "Bit_Test" Error
1
1
Entering edit mode
11.3 years ago
DoubleD ▴ 130

Hello all,

We were originally using human_g1k_v37.fasta for our reference sequence, but after some reading, I was looking at the Broad variant called "Homo_sapiens_assembly19.fasta" or the complete reference called "hg19.fa" . As far as working with MuSiC, is there a best or better pick? We have had many errors with the broad variant, and using the hg19.fa gives a lot of "Skipping invalid ROI bitmask 1:22181339-22181478" errors. Seems like g1k is the way to go? If we use one reference for the VCF processing and other for MAF creation, since they are all hg19 spec does it matter if they're different? Obviously consistence all the way through is best, but is there a big difference?

Second question, when running BMR we're getting a process ending error after joinx-ing

Parsing MAF file to classify mutations
ERROR: Can't call method "bit_test" on an undefined value at /usr/local/share/perl/5.14.2/Genome/Model/Tools/Music/Bmr/CalcBmr.pm line 575, <gen19> line 16602.

The 'undefined value' error isn't that specific, so I was hoping that someone else has had something similar.

Thank you

music reference • 5.3k views
ADD COMMENT
1
Entering edit mode

I think Broad uses the "chr" prefix on all of their chromosomes, so that alone will screw up matching if your ROIs are in the format you provided.

ADD REPLY
0
Entering edit mode

This is true. You need to ensure that you use the same chromosome naming conventions in your BAM file, FASTA, MAF file, and ROI file.

ADD REPLY
3
Entering edit mode
11.3 years ago

We recommend using Ensembl's neatly organized and properly versioned resources. For an entire project, make a decision on which Ensembl release you want to work with, and then stick with that. This will ensure a standard reference sequence, standard gene names, standard transcript/isoform definitions, standard COSMIC/OMIM/dbSNP/ENCODE datasets, etc.

If you're working with GRCh37 (aka hg19 or Build37), then use the latest Ensembl release 73. Here's some commands to grab the reference FASTA from their servers, and index it with SAMtools:

curl -LO ftp://ftp.ensembl.org/pub/release-73/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.73.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh37.73.dna.primary_assembly.fa.gz
samtools faidx Homo_sapiens.GRCh37.73.dna.primary_assembly.fa

Here's a bonus tip on creating an --roi-file for MuSiC based on Ensembl 73 isoforms. Start with the GTF transcript annotation file:

curl -LO ftp://ftp.ensembl.org/pub/release-73/gtf/homo_sapiens/Homo_sapiens.GRCh37.73.gtf.gz
gunzip Homo_sapiens.GRCh37.73.gtf.gz

Make a BED file listing CDS/ncRNA loci for each gene:

perl -ne 'chomp; @c=split(/\t/); $c[3]--; $c[8]=~s/.*gene_name\s\"([^"]+)\".*/$1/; print join("\t",@c[0,3,4,8,5,6])."\n" if($c[2] eq "CDS" or $c[2] eq "exon")' Homo_sapiens.GRCh37.73.gtf > Homo_sapiens.GRCh37.73.all_exon_loci.bed

Create another BED file that merges overlapping exons on the same strand (you'll need BEDtools' mergeBed):

mergeBed -s -nms -i Homo_sapiens.GRCh37.73.all_exon_loci.bed | perl -pe 's/;.*//' | cut -f 1-4 > Homo_sapiens.GRCh37.73.merged_exons.bed

Create an --roi-file for use with MuSiC, which adds 2bp flanks (for splice junctions) to each exon, and uses 1-based start and stop loci:

perl -ane '$F[1]--; $F[2]+=2; print join("\t",@F[0..3])."\n";' Homo_sapiens.GRCh37.73.merged_exons.bed > ensembl73_roi_file_for_music

This blind addition of 2bp flanks was reported to take at least 1 ROI outside the bounds of its sequence (an exon of AL603926.1 on GL000223.1). So here is a slightly complicated, but safer script, to create an --roi-file for MuSiC. It loads the FASTA sequence lengths into a hash called %chrEnd, and checks each ROI against this, after adding 2bp flanks:

perl -e '%chrEnd=map{chomp; split(/\t/)}`cut -f 1,2 Homo_sapiens.GRCh37.73.dna.primary_assembly.fa.fai`; map{($c,$s,$e,$g)=split(/\t/); $s--; $e+=2; $s=1 if($s<1); $e=$chrEnd{$c} if($chrEnd{$c} and $e>$chrEnd{$c}); print "$c\t$s\t$e\t$g"}`cat Homo_sapiens.GRCh37.73.merged_exons.bed`' > ensembl73_roi_file_for_music

Update: human_g1k_v37 and GRCh37 have chromosomes/contigs of equal length, including MT. So genomic loci for either reference should refer to the same thing. The nucleic acid sequence is also likely the same, because all the pending patches to GRCh37 will only be made official when GRCh38 is released. If someone had the patience to a do a diff between the FASTAs, then please let us know what you find!

ADD COMMENT
0
Entering edit mode

This is very informative and helpful, thank you Cyriac. If Somatic Sniper (and Mutect) were run using a different reference (the 1k genomes v37) then is it still possible to use the GRC reference afterwards? With v38 coming out I'm wondering if it's worth re-running mutect and sniper withe GCRh37.73 reference...

Regarding the "bit_test" error, I removed the Missense mutations from the MAF and it runs fine. We haven't run into that before, and of course would like to keep our Missense mutations...

ADD REPLY
0
Entering edit mode

No need to rerun it. human_g1k_v37.fasta is based on GRCh37. For the 1000 genomes project, they just downloaded the latest GRCh37 FASTA from Ensembl (as of Oct 10, 2009), updated MT to use the sequence at NC_012920, and renamed it to human_g1k_v37.fasta. Read this and this.

ADD REPLY
0
Entering edit mode

I went through all of this. Since our BAMs were aligned with that human_g1k_v37.fasta I think we should stick with it all the way through. My concern is that I can't find a GTF file for the g1k reference to try and make a ROI file; I've been using the provided "ensembl_67_cds_ncrna_and_splice_sites_hg19" file. Can I trust that I'm actually looking at the right region?

I did download the sequence you suggested above and made a ROI file (which I like, I'd like to use something from the ground up as you suggested) but this "Homo_sapiens.GRCh37.73.dna.primary_assembly.fa" isn't in karyotypic order; All the other references I have seen are 1,2,3... but this is 1,10,11,12... and I tried to use picard to re-order it, to no avail. Most of the posts I can find talk about resorting the BAM with picard, but isn't it easier to re-order the reference?

ADD REPLY
0
Entering edit mode

If you compare the .fai index files of human_g1k_v37.fasta and Homo_sapiens.GRCh37.73.dna.primary_assembly.fa, you'll find that the chromosome lengths are identical. So regions specified in Homo_sapiens.GRCh37.73.gtf should be the same loci on both FASTA files.

The ordering of chromosomes in a FASTA should never be an issue, unless someone wrote some crappy code that makes assumptions about that ordering. When people talk about re-sorting a BAM with picard, it's usually about ordering each read by its genomic locus within a chromosome. Having reads sorted by locus allows programs to perform better. Re-ordering the reference won't fix their woes. The reads in a BAM are (usually) already aligned to a reference.

ADD REPLY
0
Entering edit mode

I have seen that the autosomes are the same, and just the scaffolding and unmapped regions at the end of the file differ. However, as I posted in a new post this morning, I keep getting an error ERROR: Bit::Vector::Interval_Fill(): maximum index out of range at /usr/local/share/perl/5.14.2/Genome/Model/Tools/Music/Bmr/CalcBmr.pm line 232, <gen4> line 301403. which makes me think that something isn't right in my defines regions/addresses

ADD REPLY
0
Entering edit mode

Take a look at this part of MuSiC's code where bit_test is used. Please confirm that the MAF, reference fasta, BAM files, and ROI file, all use the same chromosome naming conventions.

ADD REPLY

Login before adding your answer.

Traffic: 1454 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6