Question

Tophat2 Error when moving from UCSC to NONCODE annotation

0

Entering edit mode

9.1 years ago

tunl ▴ 90

When I used a UCSC gtf file and fasta file (genome.fa) to run tophat2, I was able to run it without any problems.

However, when I used a NONCODE mouse mm10 gtf file and NONCODE mouse fa file downloaded from http://www.noncode.org/download.php, I got “Error: Couldn't build bowtie index with err = 1” as follows:

 [2016-06-30 00:49:29] Beginning TopHat run (v2.0.10)
 -----------------------------------------------
 [2016-06-30 00:49:29] Checking for Bowtie
                   Bowtie version:        2.1.0.0

 [2016-06-30 00:49:29] Checking for Samtools
                 Samtools version:        0.1.18.0

 [2016-06-30 00:49:29] Checking for Bowtie index files (genome)..

 [2016-06-30 00:49:29] Checking for reference FASTA file

 [2016-06-30 00:49:29] Generating SAM header for Bowtie2Index/genome

 [2016-06-30 00:49:32] Reading known junctions from GTF file

 [2016-06-30 00:49:35] Preparing reads
          left reads: min. length=50, max. length=50, 29800718 kept reads (16597 discarded)

 [2016-06-30 00:54:48] Building transcriptome data files ./result/346_mm10/tmp/NONCODE2016_mouse_mm10_lncRNA

 [2016-06-30 00:54:51] Building Bowtie index from NONCODE2016_mouse_mm10_lncRNA.fa
         [FAILED]

 Error: Couldn't build bowtie index with err = 1

I renamed the NONCODE mouse fa file as genome.fa in the Bowtie2Index/ directory and kept the other 7 existing files (all have base name “genome”: genome.1.bt2, genome.3.bt2, genome.rev.1.bt2, genome.2.bt2, genome.4.bt2, genome.fa.fai, genome.rev.2.bt2) from the previous UCSC Bowtie2Index/ directory. I am wondering if this may cause any problems?

I also tried (in the Bowtie2Index/ directory):

 bowtie-inspect -n genome

and got the following error message:

 Could not locate a Bowtie index corresponding to basename "genome"

I also tried this command in the original UCSC Bowtie2Index/ directory but got the same above error message even though tophat2 runs fine with the UCSC data.

I would really appreciate any solution for this problem.

Thank you very much!

tophat RNA-seq • 2.5k views

ADD COMMENT • link updated 9.1 years ago by GenoMax 152k • written 9.1 years ago by tunl ▴ 90

score 1 · Answer 1 · 2016-07-01

1

Entering edit mode

9.1 years ago

GenoMax 152k

I renamed the NONCODE mouse fa file as genome.fa in the Bowtie2Index/ directory and kept the other 7 existing files

You can't do things like this and still expect the program to work. The file names are important but more so are the contents. When you copied the NONCODE genome file it certainly had a different fasta header (and perhaps different contents) than what was in the original UCSC file.

The proper procedure here would be to re-create a new set of indexes with the NONCODE file (using bowtie2-build, no short-cuts, the genome file does not have to be called genome.fa, the name can be anything) and then proceed.

ADD COMMENT • link 9.1 years ago by GenoMax 152k

0

Entering edit mode

Thank you so much for your advice! I am now trying to create a set of new index files with the NONCODE file using bowtie2-build as follows:

bowtie2-build –f genome.fa genome

Where genome.fa is the NONCODE2016_mouse.fa I downloaded from http://www.noncode.org/download.php. I was able to successfully create six index files (genome.2.bt2, genome.4.bt2, genome.rev.1.bt2, genome.1.bt2, genome.3.bt2, genome.rev.2.bt2). However, when I rerun tophat2 with these new index files, I still get: “Error: Couldn't build bowtie2 index with err = 1” at the "Building Bowtie index" step.

When I used:

bowtie2-inspect -n genome

the output I got was very large, and the following is a small portion:

NONMMUT000001.2
NONMMUT000009.2
NONMMUT000015.2
NONMMUT000018.2

It doesn't seem to match the first column of the NONCODE mouse mm10 gtf file, rather it seems to match the transcript_id instead. The following is a small portion of my gtf file:

chr1    Cufflinks   transcript  3063334 3064403 0   +   .   gene_id "NONMMUG000001.1"; transcript_id "NONMMUT000001.1"; FPKM "0"; exon_number 1; 
chr1    Cufflinks   exon    3063334 3064403 0   +   .   gene_id "NONMMUG000001.1"; transcript_id "NONMMUT000001.1"; FPKM "0"; exon_number 1; 
chr1    Cufflinks   transcript  3456668 3503634 0   +   .   gene_id "NONMMUG000008.1"; transcript_id "NONMMUT000009.1"; FPKM "0"; exon_number 2;
chr1    Cufflinks   exon    3456668 3456768 0   +   .   gene_id "NONMMUG000008.1"; transcript_id "NONMMUT000009.1"; FPKM "0"; exon_number 2; 
chr1    Cufflinks   exon    3503486 3503634 0   +   .   gene_id "NONMMUG000008.1"; transcript_id "NONMMUT000009.1"; FPKM "0"; exon_number 2;   
chr1    Cufflinks   transcript  3670236 3671869 0   +   .   gene_id "NONMMUG000012.1"; transcript_id "NONMMUT000015.1"; FPKM "0"; exon_number 1;  
chr1    Cufflinks   exon    3670236 3671869 0   +   .   gene_id "NONMMUG000012.1"; transcript_id "NONMMUT000015.1"; FPKM "0"; exon_number 1;
chr1    Cufflinks   transcript  3869653 3898640 0   +   .   gene_id "NONMMUG000014.1"; transcript_id "NONMMUT000018.1"; FPKM "0"; exon_number 2;
chr1    Cufflinks   exon    3869653 3869781 0   +   .   gene_id "NONMMUG000014.1"; transcript_id "NONMMUT000018.1"; FPKM "0"; exon_number 2;

I am wondering if I produced the bowtie2 index files correctly? Do I need to add some command line options to bowtie2-build, or do I need to do some pre-processing with the original fasta file? I would really appreciate your help.

Thank you very much!

ADD REPLY • link 9.1 years ago by tunl ▴ 90

1

Entering edit mode

First of all are you doing these steps in a new directory starting with just the NONCODE genome file in there. It is a bad idea to mix datasets.

If I am looking at the right files then the NONCODE2016_mouse.fa file contains the sequence of just the non-coding part whereas the NONCODE2016_mouse_mm10_lncRNA.gtf contains annotation data that references the entire genome. So these two are not going to work together.

I think you are going to need to make a transcriptome specific TopHat index with just the GTF file and mm10 genome. There is a section in tophat manual about that (look for the -G option). There are threads on biostars about this as well.

ADD REPLY • link 9.1 years ago by GenoMax 152k

0

Entering edit mode

Thank you very much for the suggestion!

I created the bowtie2 index files in a new directory starting with just the NONCODE genome file inside, so the dataset should remain separate.

As you suggested, I tried to make a transcriptome specific TopHat index using the following command (based on the TopHat manual):

./tophat2 -p 8 -G genes.gtf --transcriptome-index=transcriptome_data/known --b2-very-sensitive --library-type fr-firststrand Bowtie2Index/genome

Where genes.gtf is NONCODE2016_mouse_mm10_lncRNA.gtf and Bowtie2Index/genome is the base name of the bowtie2 index files created from NONCODE2016_mouse.fa. However, I still get the following error when I run the above command:

[2016-07-03 00:16:20] Building transcriptome files with TopHat v2.0.10
-----------------------------------------------
[2016-07-03 00:16:20] Checking for Bowtie
                  Bowtie version:        2.1.0.0
[2016-07-03 00:16:20] Checking for Samtools
                Samtools version:        0.1.18.0
[2016-07-03 00:16:21] Checking for Bowtie index files (genome)..
[2016-07-03 00:16:21] Checking for reference FASTA file
[2016-07-03 00:16:21] Building transcriptome data files transcriptome_data/known
[2016-07-03 00:16:24] Building Bowtie index from known.fa
        [FAILED]
Error: Couldn't build bowtie index with err = 1

Since you mentioned that NONCODE2016_mouse.fa only contains the non-coding portion, I'm wondering whether it is sufficient to use the bowtie2 index files created by NONCODE2016_mouse.fa to make a transcriptome specific TopHat index?

What also worries me is that the output of the "bowtie2-inspect -n genome" command gave only the transcript_id's rather than the chromosomes. I'm not sure if this may be a problem?

Thank you very much for your help!

ADD REPLY • link 9.1 years ago by tunl ▴ 90