I'm trying to figure out why Tophat is failing to align a custom gtf file. I'm new to scripting, any suggestions appreciated:
My file is a subset of annotations from the gencode.v19.long_noncoding_RNAs.gtf from Gencode. A typical line looks like this:
chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.2"; transcript_id "ENSG00000243485.2"; gen
e_type "lincRNA"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "lincRNA"; transcript_status "NOVEL"; transcript_nam
e "MIR1302-11"; level 2; havana_gene "OTTHUMG00000000959.2"
My amended file looks like this:
chr1 HAVANA gene 28832492 28837404 . + . gene_id "ENSG00000242125.2"; transcript_id "ENSG000
00242125.2"; gene_type "sense_intronic"; gene_status "KNOWN"; gene_name "SNHG3"; transcript_type "sense_intronic"; transcript_statu
s "KNOWN"; transcript_name "SNHG3"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000003657.1"
I parsed the gencode.gtf file with a python script, to only take lines with specific gene names, such as SNHG3 above. I also converted any instances of 'transcript' in column 3 to CDS in the custom file, would this be causing the problem? The original file still contains all instances of 'transcript' in column 3.
However, when I run tophat with both the full and subset gtf file, there is an error for the subset file as shown in the tophat output log:
[2016-05-04 11:10:44] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2016-05-04 11:10:44] Checking for Bowtie
Bowtie version: 2.2.8.0
[2016-05-04 11:10:44] Checking for Bowtie index files (genome)..
[2016-05-04 11:10:44] Checking for reference FASTA file
[2016-05-04 11:10:44] Generating SAM header for GRCh37
[2016-05-04 11:10:46] Reading known junctions from GTF file
[2016-05-04 11:10:46] Preparing reads
left reads: min. length=101, max. length=101, 38986365 kept reads (133725 discarded)
[2016-05-04 11:24:46] Building transcriptome data files tophat_amendedLncRNAs_HIV/tmp/gtfConverted
[2016-05-04 11:24:51] Building Bowtie index from gtfConverted.fa
[2016-05-04 11:27:31] Mapping left_kept_reads to transcriptome gtfConverted with Bowtie2
[2016-05-04 16:03:50] Resuming TopHat pipeline with unmapped reads
[2016-05-04 16:03:50] Mapping left_kept_reads.m2g_um to genome GRCh37 with Bowtie2
[2016-05-04 22:25:27] Mapping left_kept_reads.m2g_um_seg1 to genome GRCh37 with Bowtie2 (1/4)
[2016-05-04 22:47:22] Mapping left_kept_reads.m2g_um_seg2 to genome GRCh37 with Bowtie2 (2/4)
[2016-05-04 23:07:50] Mapping left_kept_reads.m2g_um_seg3 to genome GRCh37 with Bowtie2 (3/4)
[2016-05-04 23:32:11] Mapping left_kept_reads.m2g_um_seg4 to genome GRCh37 with Bowtie2 (4/4)
[2016-05-05 00:10:42] Retrieving sequences for splices
[2016-05-05 00:14:12] Indexing splices
Building a SMALL index
[2016-05-05 00:14:23] Mapping left_kept_reads.m2g_um_seg1 to genome segment_juncs with Bowtie2 (1/4)
[2016-05-05 00:15:28] Mapping left_kept_reads.m2g_um_seg2 to genome segment_juncs with Bowtie2 (2/4)
[2016-05-05 00:16:40] Mapping left_kept_reads.m2g_um_seg3 to genome segment_juncs with Bowtie2 (3/4)
[2016-05-05 00:17:40] Mapping left_kept_reads.m2g_um_seg4 to genome segment_juncs with Bowtie2 (4/4)
[2016-05-05 00:18:42] Joining segment hits
[FAILED]
Error running 'long_spanning_reads':Loading insertions...done
Does anyone have any idea what may be wrong with the subset gtf file? Is it the conversion of 'transcript' to CDS?
Try Tophat previous version 2.1.0. Since 2.1.1 is a maintenance version, there might be some problem with that.
Forget Tophat, use STAR
Not very helpful. At least motivate your statement.
A fail on Tophat 2.1.0. I'll look into Star next...
If STAR crash, post a new question and I will try to help you.
Did you only select lines with
gene
annotations and leave the exon's behind? Can you post the tophat command you are running?Thanks for the replies. Here's the tophat command I ran for the failed alignment: tophat -p 8 --no-novel-juncs -G gtfConverted.gtf -o tophat_amendedLncRNAs_HIV GRCh37 HIVpos.fastq.gz.
I selected all lines, including genes and exons, assuming that the 'exon' lines also contained the same 'gene name'. The python script was written to parse out all lines that contained a specific 'gene name', irrespective of column three. Then I converted all instances of 'transcript' in column 3 to 'CDS'. This was what I was advised, as Tophat doesn't recognize 'transcript' ?
I'll try running it with Tophat version 2.1.0 in the meantime...
I think you should keep the transcript intact, and rather change exon by CDS.