If we have two RNA-Seq libraries and run tophat on each of them, then combine the resulting bam files and run cufflinks on that, will that produce the exact same result as combining the fastq files before running tophat? I know that it wouldn't be the same to combine the results after cufflinks, since a transcript may not be able to be built from reads in a single library, but combining reads from different libraries would allow it to be assembled. I'm wondering if there is something similar with tophat.
We have 8 tissue types from 2 replicates, so a total of 16 samples. We've run Tophat/cufflinks on these samples and are getting ~30 million reads aligned and expression of ~15,000 annotated genes. What we're trying to determine now is if it will be worth getting more reads from these samples, so our idea is to combine the reads of the same tissue types, collapsing it down to 8 "samples", and then redoing the analysis to see if that increases the number of expressed genes we detect. Since tophat takes a while to run, I was wondering if I could use the bam files I've already generated and just combine them, or whether it should be done before. So we are not concerned with losing the information of the origin since we're essentially combining reads from two replicates to create a virtual single sample.
My recommendation will be something simpler. When running cufflinks, you can state the status of each samples, e.g. Case / Control. Instead of giving the individual tissue + replicate types, you can simply give all the samples from the same tissue or the same replicates the same label. The reason behind this is that by combining the samples into one data, the statistic analysis will lose power because you have less samples. Whereas by giving the same sample labels, the statistic tools can take into account for the variation between different samples and therefore give better estimation.
As mentioned before, unless you want to detect novel transcripts or transcripts with extremely low expression values, you wouldn't need to worry too much about the read length.