Greetings: I am using Tophat2 (command line) to analyze RNA-seq data and I am encountering some errors.
Here is the call:
tophat2 \
-o tophat2_results/ \
-G ref_data/BA000007.2.gtf \
--transcriptome-index=transcriptome_data/RNA_LBG01b_241_filteredQ indices/BA000007.2 \
data_files/RNA_LBG01b_241_filteredQ.fastq
Here is the error:
[2015-12-29 12:58:33] Checking for Bowtie
Bowtie version: 2.2.4.0
[2015-12-29 12:58:33] Checking for Bowtie index files (genome)..
[2015-12-29 12:58:33] Checking for reference FASTA file
[2015-12-29 12:58:33] Generating SAM header for indices/BA000007.2
[2015-12-29 12:58:33] Reading known junctions from GTF file
Warning: TopHat did not find any junctions in GTF file
[2015-12-29 12:58:33] Preparing reads
left reads: min. length=12, max. length=342, 202732 kept reads (1315 discarded)
Warning: short reads (<20bp) will make TopHat quite slow and take large amount of memory because they are likely to be mapped in too many places
[2015-12-29 12:58:39] Building transcriptome data files transcriptome_data/RNA_LBG01b_241_filteredQ
[2015-12-29 12:58:40] Building Bowtie index from RNA_LBG01b_241_filteredQ.fa
[FAILED]
Error: Couldn't build bowtie index with err = 1
Version Information:
TopHat v2.1.0 Bowtie2 version 2.2.4 Python 2.7.10 :: Anaconda 2.4.0 (64-bit)
System Information:
CentOS Release 6.7
How I got here and what have I tried:
I am using E. coli (Accession: BA000007.2) for my reference genome which can be found here: http://www.ncbi.nlm.nih.gov/nuccore/BA000007.2
I obtained my GTF file from Ensembl (ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//gtf/bacteria_9_collection/escherichia_coli_o157_h7_str_sakai/)
I made my indices using bowtie2-build (before tophat2 call)
bowtie2-build -f ref_data/BA000007.2.fasta indices/BA000007.2
I am aware that the error I am receiving is affiliated with different names appearing in the first column in the *.gtf file and the name of the reference fasta file. If I understand this correctly, every entry in the 1st column should be BA000007.2 where most of the names in the 1st column where "Chromosome". To fix this, I did the following:
awk '{FS=OFS="\t"}{print "BA000007.2", $2, $3, $4, $5, $6, $7, $8, $9}' pathToGTF/BA000007.2_ensemble.gtf > pathToGTF/BA000007.2.gtf
Please note the commented build information (e.g., #!genome-build ASM80120v1
) at the beginning of Ensembl gtf file would create undesirable output from the awk command has been addressed
I also changed the termination of the fasta file from *.fasta
to *.fa
And to make sure that bowtie can access the fasta reference file, the reference file is not only in the directory with the gtf file, but ALL immediate subdirectories!
Questions:
Did I properly put the kibosh on any problems arising from differences in naming between the 1st column of the gtf file and the name of the fasta file (BA000007.2, BA000007.2.fa)?
When I peruse output in the logs directory, there are several errors (
g2f.err
& similar errors inftf_juncs.log
) with lines beginning with:Warning: invalid start coordinate at line: BA000007.2 ena gene -194 2502 . + . gene_id "BAA31757"; gene_version "1"; gene_name "tagA"; gene_source "ena"; gene_biotype "protein_coding";
There are indeed negative numbers in the gtf files, but not in the genbank file (quick search in vim). Could this be the source of the error? I commented out the specific lines and then deleted them from the file -- both approaches still result in the error.
Is there anything readily seen that could be causing the
Couldn't build bowtie index with err = 1
error?
I have been stuck on this for a couple of days so any help is greatly appreciated.
Hi;
have you tried
instead of your previous command line
Since the files would be the same, I copied the index files from indices directory to ref_data directory. I re-ran it with the same error.