Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file. If this option is provided, TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first.
This sounds great.
According to this GTF 2.2 spec - http://mblab.wustl.edu/GTF22.html - a GTF file can use exon or 3UTR or 5UTR features to represent exons. It also includes stuff about start_codon and CDS features. There are also gene and transcript id name-value pairs in the extra features field.
I don't think tophat cares about translations, so I'm guessing it can work just fine if I give it GTF with exon features only. Probably it doesn't need the "gene" extra feature attribute either.
Does anyone know the minimal data tophat needs to align reads onto a virtual transcriptome?
Maybe you've figured this out. But I did something similar and it seems to have worked. I made a gtf file where each feature is a 600bp region of the Arabidopsis chloroplast genome. I named each feature so I know its location later on down the pipeline.
I called every feature "protein_coding" and "exon" but I don't know if that matters.
So I would say the minimal information Tophat needs is a genome or chromosome and if supplied, a gtf file with valid coordinates.
By the way, I know you! I'm Ben, a student in the UT-Knoxville GST program, and I came to your workshop on metabolomics and RNA-seq a couple of years ago.
Follow-up: Is the source code hosted publicly or should I just get the source code from the tarball on the tophat site?
Is this a really dumb question?