Question

tophat GTF 2.2 format question

2

Entering edit mode

10.6 years ago

Ann ★ 2.4k

I want to run tophat using the -G option:

Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file. If this option is provided, TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first.

This sounds great.

According to this GTF 2.2 spec - http://mblab.wustl.edu/GTF22.html - a GTF file can use exon or 3UTR or 5UTR features to represent exons. It also includes stuff about start_codon and CDS features. There are also gene and transcript id name-value pairs in the extra features field.

I don't think tophat cares about translations, so I'm guessing it can work just fine if I give it GTF with exon features only. Probably it doesn't need the "gene" extra feature attribute either.

Does anyone know the minimal data tophat needs to align reads onto a virtual transcriptome?

Would this work?

chr1 BLAH  exon         150   200   .   +   .  transcript_id "X";
chr1 BLAH  exon         300   401   .   +   .  transcript_id "X";
chr1 BLAH  exon         501   650   .   +   .  transcript_id "X";
chr1 BLAH  exon         700   800   .   +   .  transcript_id "X";
chr1 BLAH  exon         900  1000   .   +   .  transcript_id "X";

Also, how would I test this?

Does the tophat code contain unit tests I could use to make sure a given GTF file is correctly read?

RNA-Seq tophat • 3.2k views

ADD COMMENT • link updated 4.3 years ago by Ram 45k • written 10.6 years ago by Ann ★ 2.4k

0

Entering edit mode

Follow-up: Is the source code hosted publicly or should I just get the source code from the tarball on the tophat site?

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 10.6 years ago by Ann ★ 2.4k

0

Entering edit mode

Is this a really dumb question?

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 10.6 years ago by Ann ★ 2.4k

Ram · Answer 1 · 2015-02-25

Maybe you've figured this out. But I did something similar and it seems to have worked. I made a gtf file where each feature is a 600bp region of the Arabidopsis chloroplast genome. I named each feature so I know its location later on down the pipeline.

I called every feature "protein_coding" and "exon" but I don't know if that matters.

Pt      protein_coding  exon    1       600     .       +       .       exon_number 1; gene_id CPt_1.600.pos; gene_name CPt_1.600.pos; seqedit false; transcript_id CPt_1.600.pos.1; transcript_name CPt_1.600.pos; tss_id CPt_1.600.pos
Pt      protein_coding  exon    1       600     .       -       .       exon_number 1; gene_id CPt_1.600.neg; gene_name CPt_1.600.neg; seqedit false; transcript_id CPt_1.600.neg.1; transcript_name CPt_1.600.neg; tss_id CPt_1.600.neg
Pt      protein_coding  exon    601     1200    .       +       .       exon_number 1; gene_id CPt_601.1200.pos; gene_name CPt_601.1200.pos; seqedit false; transcript_id CPt_601.1200.pos.1; transcript_name CPt_601.1200.pos; tss_id CPt_601.1200.pos
Pt      protein_coding  exon    601     1200    .       -       .       exon_number 1; gene_id CPt_601.1200.neg; gene_name CPt_601.1200.neg; seqedit false; transcript_id CPt_601.1200.neg.1; transcript_name CPt_601.1200.neg; tss_id CPt_601.1200.neg
Pt      protein_coding  exon    1201    1800    .       +       .       exon_number 1; gene_id CPt_1201.1800.pos; gene_name CPt_1201.1800.pos; seqedit false; tr anscript_id CPt_1201.1800.pos.1; transcript_name CPt_1201.1800.pos; tss_id CPt_1201.1800.pos
Pt      protein_coding  exon    1201    1800    .       -       .       exon_number 1; gene_id CPt_1201.1800.neg; gene_name CPt_1201.1800.neg; seqedit false; transcript_id CPt_1201.1800.neg.1; transcript_name CPt_1201.1800.neg; tss_id CPt_1201.1800.neg

So I would say the minimal information Tophat needs is a genome or chromosome and if supplied, a gtf file with valid coordinates.

By the way, I know you! I'm Ben, a student in the UT-Knoxville GST program, and I came to your workshop on metabolomics and RNA-seq a couple of years ago.