Question

Building Bowtie index failure

0

Entering edit mode

8.9 years ago

Cricket ▴ 10

Greetings: I am using Tophat2 (command line) to analyze RNA-seq data and I am encountering some errors.

Here is the call:

tophat2 \
  -o tophat2_results/ \
  -G ref_data/BA000007.2.gtf \
  --transcriptome-index=transcriptome_data/RNA_LBG01b_241_filteredQ indices/BA000007.2 \
  data_files/RNA_LBG01b_241_filteredQ.fastq

Here is the error:

[2015-12-29 12:58:33] Checking for Bowtie
          Bowtie version:     2.2.4.0
[2015-12-29 12:58:33] Checking for Bowtie index files (genome)..
[2015-12-29 12:58:33] Checking for reference FASTA file
[2015-12-29 12:58:33] Generating SAM header for indices/BA000007.2
[2015-12-29 12:58:33] Reading known junctions from GTF file
    Warning: TopHat did not find any junctions in GTF file
[2015-12-29 12:58:33] Preparing reads
     left reads: min. length=12, max. length=342, 202732 kept reads (1315 discarded)
Warning: short reads (<20bp) will make TopHat quite slow and take large amount of memory because they are likely to be mapped in too many places
[2015-12-29 12:58:39] Building transcriptome data files transcriptome_data/RNA_LBG01b_241_filteredQ
[2015-12-29 12:58:40] Building Bowtie index from RNA_LBG01b_241_filteredQ.fa
    [FAILED]
Error: Couldn't build bowtie index with err = 1

Version Information:

TopHat v2.1.0        Bowtie2 version 2.2.4              Python 2.7.10 :: Anaconda 2.4.0 (64-bit)

System Information:

CentOS Release 6.7

How I got here and what have I tried:

I am using E. coli (Accession: BA000007.2) for my reference genome which can be found here: http://www.ncbi.nlm.nih.gov/nuccore/BA000007.2

I obtained my GTF file from Ensembl (ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//gtf/bacteria_9_collection/escherichia_coli_o157_h7_str_sakai/)

I made my indices using bowtie2-build (before tophat2 call)

bowtie2-build -f ref_data/BA000007.2.fasta indices/BA000007.2

I am aware that the error I am receiving is affiliated with different names appearing in the first column in the *.gtf file and the name of the reference fasta file. If I understand this correctly, every entry in the 1st column should be BA000007.2 where most of the names in the 1st column where "Chromosome". To fix this, I did the following:

awk '{FS=OFS="\t"}{print "BA000007.2", $2, $3, $4, $5, $6, $7, $8, $9}' pathToGTF/BA000007.2_ensemble.gtf > pathToGTF/BA000007.2.gtf

Please note the commented build information (e.g., #!genome-build ASM80120v1) at the beginning of Ensembl gtf file would create undesirable output from the awk command has been addressed

I also changed the termination of the fasta file from *.fasta to *.fa

And to make sure that bowtie can access the fasta reference file, the reference file is not only in the directory with the gtf file, but ALL immediate subdirectories!

Questions:

Did I properly put the kibosh on any problems arising from differences in naming between the 1st column of the gtf file and the name of the fasta file (BA000007.2, BA000007.2.fa)?
When I peruse output in the logs directory, there are several errors (g2f.err & similar errors in ftf_juncs.log) with lines beginning with:
```
Warning: invalid start coordinate at line: 
BA000007.2    ena    gene    -194    2502    .    +    .    gene_id "BAA31757"; gene_version "1"; gene_name "tagA"; gene_source "ena"; gene_biotype "protein_coding";
```
There are indeed negative numbers in the gtf files, but not in the genbank file (quick search in vim). Could this be the source of the error? I commented out the specific lines and then deleted them from the file -- both approaches still result in the error.
Is there anything readily seen that could be causing the Couldn't build bowtie index with err = 1 error?

I have been stuck on this for a couple of days so any help is greatly appreciated.

RNA-Seq tophat2 bowtie2 • 6.2k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.9 years ago by Cricket ▴ 10

0

Entering edit mode

Hi;

have you tried

bowtie2-build -f ref_data/BA000007.2.fasta ref_data/BA000007.2

instead of your previous command line

bowtie2-build -f ref_data/BA000007.2.fasta indices/BA000007.2

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by Mehmet ▴ 820

0

Entering edit mode

Since the files would be the same, I copied the index files from indices directory to ref_data directory. I re-ran it with the same error.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by Cricket ▴ 10

0

Entering edit mode

8.9 years ago

biocyberman ▴ 870

Make sure you have chromosome name the same in the GTF and the sequence ID in the reference fasta file. They look different to me.

ADD COMMENT • link 8.9 years ago by biocyberman ▴ 870

0

Entering edit mode

Thank you biocyberman for your response. I believe I addressed your suggestion already with the awk command -- that was the entire purpose of the code. After running the command, all entries in the first column will be BA000007.2. The name of my reference fasta file is BA000007.2.fa.

If this is not the correct approach please let me know.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by Cricket ▴ 10

0

Entering edit mode

The awk command may not give what you expect. Check the output. I tried the command on the gtf file and it replaces the comment lines as well. Try this one:

awk 'BEGIN{OFS="\t"}($0 !~/^#/){$1 = "BA000007"} {print}'

(I removed the .2 vesion number, just in case other tools do not like the dot).

After that rerun the tophat command that tries to create the transcriptome index. Or, run tophat without specify --transcriptome-index option.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by biocyberman ▴ 870

0

Entering edit mode

Biocyberman -- thank you for your attention and assistance.

I am aware that the first lines of the file will not allow me to do what I intended with the awk command. I dealt with it already and mentioned it within the text of my question (#please note the commented build information (e.g., #!genome-build ASM80120v1) at the beginning of ensemble gtf file would create undesirable output from the awk command has been addressed).

As for your other suggestions...

I removed the .2 in vim (:%s/^BA000007.2/BA000007/g) from the 1st column in the gtf file. I changed the name of the fasta file to reflect this and re-ran it and received an identical error.

I removed the --transcriptome-index from the code, ran it again, and received an identical error.

Do you have any other suggestions?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by Cricket ▴ 10

score 1 · Accepted Answer · 2015-12-31

1

Entering edit mode

8.9 years ago

Cricket ▴ 10

Found the source of the problem...it was my referential fasta file. The header needed to be ">BA000007". Changing that solved everything.

ADD COMMENT • link 8.9 years ago by Cricket ▴ 10