I want to use RNAseq data to quantify gene expression with no intention of finding new transcripts. It would be timesaving to align the reads directly to a transcriptome index built from Refseq RNA, instead of aligning it to genome and look for annotations. I use bowtie to align the reads and then am using cufflinks to quantify the reads (cufflinks has no way of differentiating transcriptome alignments from genome alignments). However, I am not sure if cufflinks can calculate FPKM correctly with this.
How to incorporate gene length information (would I have to make a GTF file for that ?).
I did read the methodology of cufflinks given in the supplementary info of the paper but am not exactly sure how it approximates the values. Moreover the statistical model that it uses, considers transcriptome as a subset of genome and equations are written accordingly. I think this particular question must have been asked numerous times but- how exactly does cufflinks calculate FPKM ?
Would it rather be better, in this case, that I write my own script to calculate FPKM (considering one pair as a fragment) ?
Should fragment lengths be normalized ?
I think, that the benefit in speed will be minimal (if it is your only motivation), it's easier to run the pipeline using -G avoiding assembly, and continue with DE analyses.
won't searching through the entire genome using annotations (GTF) be more time-taking than searching just a few annotated RNAs (~30000).
Does the
-T/--transcriptome-only
option in tophat require GTF annotation?This will reduce the search space too, - good find! (If you mean annotation as functions - I don't think TopHat cares.) In general, again it is my opinion, the indexing/search problem is solved at large. Once index is built, it is constant time.