I'm trying to align a set of reads in fasta format to a reference genome (indexed with bowtie2-build), but tophat is failing with the following error:
tophat thc_genomic 5369.fa
[2017-04-20 10:14:53] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2017-04-20 10:14:53] Checking for Bowtie
Bowtie version: 2.3.1.0
[2017-04-20 10:14:53] Checking for Bowtie index files (genome)..
[2017-04-20 10:14:53] Checking for reference FASTA file
[2017-04-20 10:14:53] Generating SAM header for thc_genomic
[2017-04-20 10:14:53] Preparing reads
left reads: min. length=95, max. length=101, 4488 kept reads (0 discarded)
[2017-04-20 10:14:54] Mapping left_kept_reads to genome thc_genomic with Bowtie2
[FAILED]
Error running bowtie:
Error: reads file does not look like a FASTQ file
Error: Encountered exception: 'Unidentified exception'
Command: /usr/local/bin/../Cellar/bowtie2/2.3.0/bin/bowtie2-align-s --wrapper basic-0 -k 20 -D 15 -R 2 -N 0 -L 20 -i S,1,1.25 --gbar 4 --mp 6,2 --np 1 --rdg 5,3 --rfg 5,3 --score-min C,-14,0 -p 1 --sam-no-hd -x thc_genomic -
(ERR): bowtie2-align exited with value 1
(sorry the cut and paste is making that a little hard to read).
I agree that the input file doesn't look like fastq because actually, it's not. It's a fasta file, and the docs say that tophat now auto-detects fasta and behaves accordingly. I don't see a command line flag to force it to treat the input file as fasta -- what am I missing?
My input read file looks like this:
>gnl|SRA|SRR352208.104462060.1 HWI-ST600:7:68:16692:76873
TAAGTTTAATAAATCTAAGTCAAAGTTCAATATCGGTCCCTTCCGATGCATACTCCATGCATCCAACCTG
AGCTTTACTTTAACCAATGCTCTGGAAAGAA
>gnl|SRA|SRR352208.103232838.2 HWI-ST600:7:67:16843:150317
AACTAGGCGATATTGACTTGAATAGATATTACGGTAAGTTTAATAAATCTAAGTCAAAGTTCAATATCGG
TCCCTTCCGATGCATACTCCATGCATCCAAC
Surely I'm missing a command line flag, or does tophat need the fasta id line to be formatted a specific way?
Thanks
I have formatted your code correctly. In future use the icon shown below (after highlighting the text you want to format as code) when editing (Screenshot courtsey of @Wouter).
I know by personal experience that Tophat2 is picky about read names. Try with a small subset of reads: rename them like ">read1" to ">readN" and see if it works! Perhaps it doesn't like the illumina read name format in a FASTA but only in a FASTQ.
Yeah, I tried that, with the id lines changed just the way you suggested, but I get the same result.
Are those actual white lines between your fasta records?
You should also know that Tophat is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. (If you can't get access to that publication, let me know and I'll -cough- help you.) There are also other alternatives, including alignment with STAR and bbmap,...
the blank line between the two fasta records is an artifact of cutting and pasting from a terminal window. It's not present in the actual fasta file.
But your second point is very interesting, I didn't know the field moved on, I'm going to read that paper right now (and I have -cough- ways -cough- of getting around most access restrictions, but I appreciate the offer).
Thanks for the advice.
Any splice-aware aligner (BBMap, STAR are good options) along with featureCounts from Subread package would be a great option to look into.