GTF format that's acceptable for Tophat/Cufflinks
1
1
Entering edit mode
8.6 years ago
ruthjfahey ▴ 20

I'm trying to figure out why Tophat is failing to align a custom gtf file. I'm new to scripting, any suggestions appreciated:

My file is a subset of annotations from the gencode.v19.long_noncoding_RNAs.gtf from Gencode. A typical line looks like this:

chr1    HAVANA  gene    29554   31109   .   +   .   gene_id "ENSG00000243485.2"; transcript_id "ENSG00000243485.2"; gen
e_type "lincRNA"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "lincRNA"; transcript_status "NOVEL"; transcript_nam
e "MIR1302-11"; level 2; havana_gene "OTTHUMG00000000959.2"

My amended file looks like this:

chr1    HAVANA  gene    28832492    28837404    .   +   .   gene_id "ENSG00000242125.2"; transcript_id "ENSG000
00242125.2"; gene_type "sense_intronic"; gene_status "KNOWN"; gene_name "SNHG3"; transcript_type "sense_intronic"; transcript_statu
s "KNOWN"; transcript_name "SNHG3"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000003657.1"

I parsed the gencode.gtf file with a python script, to only take lines with specific gene names, such as SNHG3 above. I also converted any instances of 'transcript' in column 3 to CDS in the custom file, would this be causing the problem? The original file still contains all instances of 'transcript' in column 3.

However, when I run tophat with both the full and subset gtf file, there is an error for the subset file as shown in the tophat output log:

[2016-05-04 11:10:44] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2016-05-04 11:10:44] Checking for Bowtie
          Bowtie version:    2.2.8.0
[2016-05-04 11:10:44] Checking for Bowtie index files (genome)..
[2016-05-04 11:10:44] Checking for reference FASTA file
[2016-05-04 11:10:44] Generating SAM header for GRCh37
[2016-05-04 11:10:46] Reading known junctions from GTF file
[2016-05-04 11:10:46] Preparing reads
     left reads: min. length=101, max. length=101, 38986365 kept reads (133725 discarded)
[2016-05-04 11:24:46] Building transcriptome data files tophat_amendedLncRNAs_HIV/tmp/gtfConverted
[2016-05-04 11:24:51] Building Bowtie index from gtfConverted.fa
[2016-05-04 11:27:31] Mapping left_kept_reads to transcriptome gtfConverted with Bowtie2 
[2016-05-04 16:03:50] Resuming TopHat pipeline with unmapped reads
[2016-05-04 16:03:50] Mapping left_kept_reads.m2g_um to genome GRCh37 with Bowtie2 
[2016-05-04 22:25:27] Mapping left_kept_reads.m2g_um_seg1 to genome GRCh37 with Bowtie2 (1/4)
[2016-05-04 22:47:22] Mapping left_kept_reads.m2g_um_seg2 to genome GRCh37 with Bowtie2 (2/4)
[2016-05-04 23:07:50] Mapping left_kept_reads.m2g_um_seg3 to genome GRCh37 with Bowtie2 (3/4)
[2016-05-04 23:32:11] Mapping left_kept_reads.m2g_um_seg4 to genome GRCh37 with Bowtie2 (4/4)
[2016-05-05 00:10:42] Retrieving sequences for splices
[2016-05-05 00:14:12] Indexing splices
Building a SMALL index
[2016-05-05 00:14:23] Mapping left_kept_reads.m2g_um_seg1 to genome segment_juncs with Bowtie2 (1/4)
[2016-05-05 00:15:28] Mapping left_kept_reads.m2g_um_seg2 to genome segment_juncs with Bowtie2 (2/4)
[2016-05-05 00:16:40] Mapping left_kept_reads.m2g_um_seg3 to genome segment_juncs with Bowtie2 (3/4)
[2016-05-05 00:17:40] Mapping left_kept_reads.m2g_um_seg4 to genome segment_juncs with Bowtie2 (4/4)
[2016-05-05 00:18:42] Joining segment hits
    [FAILED]
Error running 'long_spanning_reads':Loading insertions...done

Does anyone have any idea what may be wrong with the subset gtf file? Is it the conversion of 'transcript' to CDS?

RNA-Seq alignment • 3.8k views
ADD COMMENT
1
Entering edit mode

Try Tophat previous version 2.1.0. Since 2.1.1 is a maintenance version, there might be some problem with that.

ADD REPLY
1
Entering edit mode

Forget Tophat, use STAR

ADD REPLY
3
Entering edit mode

Not very helpful. At least motivate your statement.

ADD REPLY
1
Entering edit mode

A fail on Tophat 2.1.0. I'll look into Star next...

ADD REPLY
0
Entering edit mode

If STAR crash, post a new question and I will try to help you.

ADD REPLY
0
Entering edit mode

Did you only select lines with gene annotations and leave the exon's behind? Can you post the tophat command you are running?

ADD REPLY
0
Entering edit mode

Thanks for the replies. Here's the tophat command I ran for the failed alignment: tophat -p 8 --no-novel-juncs -G gtfConverted.gtf -o tophat_amendedLncRNAs_HIV GRCh37 HIVpos.fastq.gz.

I selected all lines, including genes and exons, assuming that the 'exon' lines also contained the same 'gene name'. The python script was written to parse out all lines that contained a specific 'gene name', irrespective of column three. Then I converted all instances of 'transcript' in column 3 to 'CDS'. This was what I was advised, as Tophat doesn't recognize 'transcript' ?

I'll try running it with Tophat version 2.1.0 in the meantime...

ADD REPLY
0
Entering edit mode

I think you should keep the transcript intact, and rather change exon by CDS.

ADD REPLY
0
Entering edit mode
8.5 years ago

Any luck? The GTF format looks fine to me. Did you replace the word transcript there with CDS too (i.e. turn it into CDS_name in the info column) or just transcript in column 3? Glancing at the code tophat is looking for something matching "transcript_name" at least at some point in it.

ADD COMMENT
0
Entering edit mode

I only changed the transcript in column 3 to 'CDS'. I can modify my code to switch 'transcript_name' to 'CDS_name in the info column. I take it that tophat is getting confused when the feature type is CDS and there is no corresponding CDS_name in the info column, just 'transcript_name' ? I'll let you know if it works.

ADD REPLY
0
Entering edit mode

Still no luck with tophat, same error. When you say it's looking for something matching 'transcipt_name' in the code, whereabouts are you seeing that? I amended the gtf so that a typical amended line reads:

chr1    HAVANA  CDS 28832492    28837404    .   +   .   gene_id "ENSG00000242125.2"; transcript_id "ENST00000413987.1"; gene_type "sense_intronic"; gene_status "KNOWN"; gene_name "SNHG3"; transcript_type "retained_intron"; transcript_status "KNOWN"; CDS_name "SNHG3-001"; level 2; havana_gene

Here's the tophat log:

[2016-05-17 14:52:45] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2016-05-17 14:52:45] Checking for Bowtie
          Bowtie version:    2.2.8.0
[2016-05-17 14:52:45] Checking for Bowtie index files (genome)..
[2016-05-17 14:52:45] Checking for reference FASTA file
[2016-05-17 14:52:45] Generating SAM header for genome
[2016-05-17 14:52:46] Reading known junctions from GTF file
[2016-05-17 14:52:46] Preparing reads
     left reads: min. length=101, max. length=101, 38986365 kept reads (133725 di
scarded)
[2016-05-17 15:04:50] Building transcriptome data files tophat_HIV_out/tmp/genes
[2016-05-17 15:05:00] Building Bowtie index from genes.fa
[2016-05-17 15:07:35] Mapping left_kept_reads to transcriptome genes with Bowtie2 
[2016-05-17 17:29:13] Resuming TopHat pipeline with unmapped reads
[2016-05-17 17:29:13] Mapping left_kept_reads.m2g_um to genome genome with Bowtie2 
[2016-05-17 20:43:30] Mapping left_kept_reads.m2g_um_seg1 to genome genome with Bowti
e2 (1/4)
[2016-05-17 20:56:47] Mapping left_kept_reads.m2g_um_seg2 to genome genome with Bowti
e2 (2/4)
[2016-05-17 21:09:23] Mapping left_kept_reads.m2g_um_seg3 to genome genome with Bowti
e2 (3/4)
[2016-05-17 21:24:00] Mapping left_kept_reads.m2g_um_seg4 to genome genome with Bowti
e2 (4/4)
[2016-05-17 21:46:41] Retrieving sequences for splices
[2016-05-17 21:48:18] Indexing splices
Building a SMALL index
[2016-05-17 21:48:18] Mapping left_kept_reads.m2g_um_seg1 to genome segment_juncs wit
h Bowtie2 (1/4)
[2016-05-17 21:49:01] Mapping left_kept_reads.m2g_um_seg2 to genome segment_juncs wit
h Bowtie2 (2/4)
[2016-05-17 21:49:42] Mapping left_kept_reads.m2g_um_seg3 to genome segment_juncs wit
h Bowtie2 (3/4)
[2016-05-17 21:50:24] Mapping left_kept_reads.m2g_um_seg4 to genome segment_juncs wit
h Bowtie2 (4/4)
[2016-05-17 21:51:04] Joining segment hits
    [FAILED]
Error running 'long_spanning_reads':Loading insertions...done
ADD REPLY
0
Entering edit mode

I see it in src/SeqAn-1.4.2/seqan/store/store_io_gff.h and src/tophat-fusion-post (although I think the second isn't relevant in this case). In the header file it looks like it is trying to match some type of key, which would make sense that it is looking in the info column of the GTF. I have to stress I'm not familiar with the TopHat code so it's possible that it is unrelated; I just did a quick search of the codebase.

It seems to me that it is generally accepting the GTF file (if I remember correctly it will fail earlier if it is bad). TopHat has fixed a couple of bugs with GTF/GFF files over the last several versions. Are you able to try an earlier version, maybe 2.1.0 or so http://ccb.jhu.edu/software/tophat/downloads/? Its possible that you have some type of edge case/bug that was introduced in this version. Otherwise I don't see any problems (I'm assume the columns are tab separated and the formatting here just doesn't show that).

Also, have you tried running it without any CDS changes? I'm not sure how you're changing transcript to CDS, but I've run into the problem with awk in the past that in the middle of a file it screwed up a tab and ruined the format (I think this can happen is the replacement makes the end of the column exactly the character before a new column would start). I would expect that tophat would fail earlier if this was the case but I suppose anything is possible.

ADD REPLY

Login before adding your answer.

Traffic: 1982 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6