I am running tophat2.1.0 (bowtie 2.2.6.0) on SE RNAseq data and encountering the following error:
[2016-02-29 09:38:34] Beginning TopHat run (v2.1.0)
-----------------------------------------------
[2016-02-29 09:38:34] Checking for Bowtie
Bowtie version: 2.2.6.0
[2016-02-29 09:38:34] Checking for Bowtie index files (transcriptome)..
[2016-02-29 09:38:34] Checking for Bowtie index files (genome)..
[2016-02-29 09:38:34] Checking for reference FASTA file
[2016-02-29 09:38:34] Generating SAM header for PGSC_DM_v4.03_index
[2016-02-29 09:38:40] Reading known junctions from GTF file
[2016-02-29 09:38:44] Preparing reads
left reads: min. length=30, max. length=40, 47957373 kept reads (864 discarded)
[2016-02-29 09:48:09] Using pre-built transcriptome data..
[2016-02-29 09:48:11] Mapping left_kept_reads to transcriptome known with Bowtie2
[FAILED]
Error running:
/opt/software/TopHat2/2.1.0--GCC-4.4.5/bin/bam2fastx --all tophat_Kalkaska/tmp/left_kept_reads.bam|/opt/software/bowtie2/2.2.6--GCC-4.4.5/bin/bowtie2 -k 60 -D 15 -R 2 -N 0 -L 20 -i S,1,1.25 --gbar 4 --mp 6,2 --np 1 --rdg 5,3 --rfg 5,3 --score-min C,-62,0 -p 1 --sam-no-hd -x transcriptome_data/known -|/opt/software/TopHat2/2.1.0--GCC-4.4.5/bin/fix_map_ordering --bowtie2-min-score 55 --read-mismatches 3 --read-gap-length 10 --read-edit-dist 10 --read-realign-edit-dist 11 --sam-header tophat_Kalkaska/tmp/known.bwt.samheader.sam - - tophat_Kalkaska/tmp/left_kept_reads.m2g_um.bam | /opt/software/TopHat2/2.1.0--GCC-4.4.5/bin/map2gtf --sam-header tophat_Kalkaska/tmp/PGSC_DM_v4.03_index_genome.bwt.samheader.sam transcriptome_data/known.fa.tlst - tophat_Kalkaska/tmp/left_kept_reads.m2g.bam > tophat_Kalkaska/logs/m2g_left_kept_reads.out
When I run the error output, I get:
Error at parsing .tlst line (invalid strand):
53551 PGSC0003DMT400030180 ST4.03ch12. 56298573-56298656
(ERR): bowtie2-align died with signal 13 (PIPE)
Looking at the GTF file (PGSC_DM_V403_genes.gff from SpudDB), there are a fair number of entries where the strand is recorded as .
Online forums report the same issue but no solutions, save for deleting these entries, where are very numerous. Are there any other solutions I could try?
Are those entries redundant i.e. are there lines with those ID's that have valid strand information?
The entries lacking valid strand information are unique.
I had a look at the GFF file. It appears that most of the annotation entries are from cufflinks and few other gene prediction algorithms (GLEAN, BESTORF etc). Even thought the coordinates are different most of the entries appear to be covered by cufflinks records, with different co-ordinates (in the neighborhood).
Since you don't have a better alternative (I assume) you may want to remove entries that don't have a +/- (since valid GFF options for strand field are only those) and proceed with the analysis.
As I do not have a better alternative, I filtered the GFF and ran tophat2 using the filtered file and did not encounter any errors.