Question

Is combining fastq files before running tophat the same as combining bam file output from tophat?

1

Entering edit mode

10.1 years ago

colin.kern ★ 1.1k

If we have two RNA-Seq libraries and run tophat on each of them, then combine the resulting bam files and run cufflinks on that, will that produce the exact same result as combining the fastq files before running tophat? I know that it wouldn't be the same to combine the results after cufflinks, since a transcript may not be able to be built from reads in a single library, but combining reads from different libraries would allow it to be assembled. I'm wondering if there is something similar with tophat.

RNA-Seq tophat • 3.8k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 10.1 years ago by colin.kern ★ 1.1k

Ram · Answer 1 · 2015-06-26

1

Entering edit mode

10.1 years ago

Sam ★ 4.8k

It depends on what you mean by combining.

If, for example, you are combining different lane / run of the same sample, then it should be the same whether if you merge the bam or fastq file. However, you might need to make sure the read group setting allow you to specify them as the same samples.

However, if your are trying to combine different samples, then you should not combine them before running tophat unless your main goal is the detection of novo transcripts. The problem is that if you merge the fastq before the alignment, you will lost the information of origin, e.g. you don't know if read A is from sample 1 or sample 2.

Now if you are trying to detect novo transcripts, Trinity does suggest the merging of fastq file before the denovo assembly for the reason you've mentioned: The novo transcript might only be partially captured in single library. So according to my experience (which was 2 years ago, might have changed now), to detect the novo-transcripts, you will merge the fastq file and try to construct the novo-transcripts, then you align the reads of individual samples back to the novo-transcript list to get per individual alignment info

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 10.1 years ago by Sam ★ 4.8k

0

Entering edit mode

We have 8 tissue types from 2 replicates, so a total of 16 samples. We've run Tophat/cufflinks on these samples and are getting ~30 million reads aligned and expression of ~15,000 annotated genes. What we're trying to determine now is if it will be worth getting more reads from these samples, so our idea is to combine the reads of the same tissue types, collapsing it down to 8 "samples", and then redoing the analysis to see if that increases the number of expressed genes we detect. Since tophat takes a while to run, I was wondering if I could use the bam files I've already generated and just combine them, or whether it should be done before. So we are not concerned with losing the information of the origin since we're essentially combining reads from two replicates to create a virtual single sample.

ADD REPLY • link 10.1 years ago by colin.kern ★ 1.1k

0

Entering edit mode

My recommendation will be something simpler. When running cufflinks, you can state the status of each samples, e.g. Case / Control. Instead of giving the individual tissue + replicate types, you can simply give all the samples from the same tissue or the same replicates the same label. The reason behind this is that by combining the samples into one data, the statistic analysis will lose power because you have less samples. Whereas by giving the same sample labels, the statistic tools can take into account for the variation between different samples and therefore give better estimation.

As mentioned before, unless you want to detect novel transcripts or transcripts with extremely low expression values, you wouldn't need to worry too much about the read length.

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 10.1 years ago by Sam ★ 4.8k