Although I'm very inexperienced with bioinformatics, what I"m trying to do is very straightforward. I want to align my Miseq mRNAseq reads to the mouse transcriptome.
Thus far, I've downloaded the Ensembl GCRm38dna.fa genome file and indexed it with bowtie2-build
I've also downlead the Ensembl GCRm38.85.GTF file for transcriptome annotation
To run tophat, I'm using the following command (default parameters):
tophat2 -G MusGRCm3885.gtf MusGRCm38dna 560RF.fastq
However, I'm getting the error:
I'm not quite sure what's going on. The computer I"m using has ~4 GB ram. Should I change the min length to <50, considering my mRNA snippets are ~30 bases?
[2016-09-16 13:05:51] Checking for Bowtie
Bowtie version: 2.2.9.0
[2016-09-16 13:05:52] Checking for Bowtie index files (genome)..
[2016-09-16 13:05:52] Checking for reference FASTA file
[2016-09-16 13:05:52] Generating SAM header for MusGRCm38dna
[2016-09-16 13:07:06] Reading known junctions from GTF file
[2016-09-16 13:07:33] Preparing reads
left reads: min. length=50, max. length=50, 20782969 kept reads (99 discarded)
[2016-09-16 13:12:10] Building transcriptome data files ./tophat_out/tmp/MusGRCm3885
[2016-09-16 13:13:41] Building Bowtie index from MusGRCm3885.fa
[2016-09-16 13:30:36] Mapping left_kept_reads to transcriptome MusGRCm3885 with Bowtie2
[2016-09-16 13:47:27] Resuming TopHat pipeline with unmapped reads
[2016-09-16 13:47:27] Mapping left_kept_reads.m2g_um to genome MusGRCm38dna with Bowtie2
[2016-09-16 14:33:35] Mapping left_kept_reads.m2g_um_seg1 to genome MusGRCm38dna with Bowtie2 (1/2)
[2016-09-16 15:30:31] Mapping left_kept_reads.m2g_um_seg2 to genome MusGRCm38dna with Bowtie2 (2/2)
[2016-09-16 15:55:52] Searching for junctions via segment mapping
Coverage-search algorithm is turned on, making this step very slow
Please try running TopHat again with the option (--no-coverage-search) if this step takes too much time or memory.
[FAILED]
Error: segment-based junction search failed with err =-9
found 0 potential small insertions
You'll have to look through the tophat log to find the last command it's running. If you then run that yourself you'll get the actual underlying error message, which will hopefully be more informative.
Having said that, STAR is faster and tends to produce better results.
running the coverage search can be very intensive in terms of memory and cpu usage. You might have better luck running it on a cluster using the parallel option.
I agree with Devon that you might want to shift to STAR, which will eventually take Tophat2 place. But... what do you mean with "considering my mRNA snippets are ~30 bases"?