tophat2 GTF invalid strand error
1
0
Entering edit mode
8.7 years ago
natsterbug ▴ 10

I am running tophat2.1.0 (bowtie 2.2.6.0) on SE RNAseq data and encountering the following error:

[2016-02-29 09:38:34] Beginning TopHat run (v2.1.0)
-----------------------------------------------
[2016-02-29 09:38:34] Checking for Bowtie
Bowtie version: 2.2.6.0
[2016-02-29 09:38:34] Checking for Bowtie index files (transcriptome)..
[2016-02-29 09:38:34] Checking for Bowtie index files (genome)..
[2016-02-29 09:38:34] Checking for reference FASTA file
[2016-02-29 09:38:34] Generating SAM header for PGSC_DM_v4.03_index
[2016-02-29 09:38:40] Reading known junctions from GTF file
[2016-02-29 09:38:44] Preparing reads
left reads: min. length=30, max. length=40, 47957373 kept reads (864 discarded)
[2016-02-29 09:48:09] Using pre-built transcriptome data..
[2016-02-29 09:48:11] Mapping left_kept_reads to transcriptome known with Bowtie2
[FAILED]
Error running:
/opt/software/TopHat2/2.1.0--GCC-4.4.5/bin/bam2fastx --all tophat_Kalkaska/tmp/left_kept_reads.bam|/opt/software/bowtie2/2.2.6--GCC-4.4.5/bin/bowtie2 -k 60 -D 15 -R 2 -N 0 -L 20 -i S,1,1.25 --gbar 4 --mp 6,2 --np 1 --rdg 5,3 --rfg 5,3 --score-min C,-62,0 -p 1 --sam-no-hd -x transcriptome_data/known -|/opt/software/TopHat2/2.1.0--GCC-4.4.5/bin/fix_map_ordering --bowtie2-min-score 55 --read-mismatches 3 --read-gap-length 10 --read-edit-dist 10 --read-realign-edit-dist 11 --sam-header tophat_Kalkaska/tmp/known.bwt.samheader.sam - - tophat_Kalkaska/tmp/left_kept_reads.m2g_um.bam | /opt/software/TopHat2/2.1.0--GCC-4.4.5/bin/map2gtf --sam-header tophat_Kalkaska/tmp/PGSC_DM_v4.03_index_genome.bwt.samheader.sam transcriptome_data/known.fa.tlst - tophat_Kalkaska/tmp/left_kept_reads.m2g.bam > tophat_Kalkaska/logs/m2g_left_kept_reads.out

When I run the error output, I get:

Error at parsing .tlst line (invalid strand):
53551 PGSC0003DMT400030180 ST4.03ch12. 56298573-56298656
(ERR): bowtie2-align died with signal 13 (PIPE)

Looking at the GTF file (PGSC_DM_V403_genes.gff from SpudDB), there are a fair number of entries where the strand is recorded as .

Online forums report the same issue but no solutions, save for deleting these entries, where are very numerous. Are there any other solutions I could try?

RNA-Seq software error • 4.5k views
ADD COMMENT
0
Entering edit mode

ADD REPLY
0
Entering edit mode

Are those entries redundant i.e. are there lines with those ID's that have valid strand information?

ADD REPLY
0
Entering edit mode

The entries lacking valid strand information are unique.

ADD REPLY
0
Entering edit mode

I had a look at the GFF file. It appears that most of the annotation entries are from cufflinks and few other gene prediction algorithms (GLEAN, BESTORF etc). Even thought the coordinates are different most of the entries appear to be covered by cufflinks records, with different co-ordinates (in the neighborhood).

Since you don't have a better alternative (I assume) you may want to remove entries that don't have a +/- (since valid GFF options for strand field are only those) and proceed with the analysis.

ADD REPLY
0
Entering edit mode

As I do not have a better alternative, I filtered the GFF and ran tophat2 using the filtered file and did not encounter any errors.

ADD REPLY
0
Entering edit mode
8.7 years ago

Not really. It is an invalid input file that needs to be filtered.

ADD COMMENT
0
Entering edit mode

The entries with "." make up 7% of the total and I am concerned about removing all these. Is this a valid concern?

ADD REPLY
0
Entering edit mode

If you don't have strand information then the aligner can't splice over these.

Tophat will find new splicing sites (that is those not listed in the file) so you don't have to be overly concerned with missing out.

ADD REPLY

Login before adding your answer.

Traffic: 2256 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6