Question

I have demultiplexed files for a single biological replicate. When to combine them in pipeline?

1

Entering edit mode

9.9 years ago

Kristin Muench ▴ 640

I have a database of 10 distinct biological samples. Each of these samples was sequenced (RNA-Seq) using paired-end reads and 6 barcodes. Thus, for each of the ten biological samples, I have 12 .fastq files, with names like ATACTC_1, ATACTC_2, GTGCTC_1, GTGCTC_2...and so on.

I would like to follow this pipeline:

Analyze data quality with FastQC
Trim data with Trim Galore!
Align with TopHat2
??? generate counts, differential expression analysis, etc.

Here is my question: at what point in this pipeline can I (should I) combine all of the .fastq data together? If each sample has 12 files associate it, at what point do I collapse the 12 files into a single file representing the RNA-Seq data for a single biological sample that I can analyze for counts in step #4?

I'm guessing I combine all of the .fastq files up front (re-multiplex?) with cat file1...file12. I could also do steps #1-2 or steps #1-3 completely, and then combine the output of step #3.

Thank you for any help you can provide! This board has already been tremendously helpful to me.

FastQC RNA-Seq • 4.3k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Kristin Muench ▴ 640

Ram · Accepted Answer · 2015-01-13

2

Entering edit mode

9.9 years ago

matted 7.8k

I think it's most typical to combine raw reads if they correspond to the same library and original sample (so merging technical replicates, with respect to the sequencer, and not biological replicates). This paper was one of the early ones to test and validate this assumption.

So for your outlined process, that could be anywhere in steps 1 to 3. Personally, I would make count tables for all the 6*10 runs separately and then combine the count tables at the very end, before any clustering or differential analysis (so in the middle of your step 4). This is because all the earlier steps can be performed in parallel, and you might save some time by processing many chunks at once.

And just for completeness, you'll need to combine the two matching paired end fastq files (e.g. X_1 and X_2) before aligning. You might need to combine them or analyze them together for the adapter trimming, or possibly trim adapters for each read end separately.

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by matted 7.8k

0

Entering edit mode

Ah, that's so helpful! Thank you very much.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Kristin Muench ▴ 640

0

Entering edit mode

Actually, a question for clarification: I suspect that the 6 barcoded runs for each sample are not technical replicates, but actually 1/6 the volume of the total sample library (so the libraries are not the same, although they all come from the same sample). In that case, do need to analyze each file separately and then treat 'sample' as a cofactor in any differential expression analysis, or is there still a way to combine the data?

EDIT: Oops - it occurs to me that this still could fit the definition of a technical replicate, in which case it would be fine to combine just as you suggested.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Kristin Muench ▴ 640

0

Entering edit mode

That's a good question about the separate library replicates. To be honest, it's somewhat of an unusual design (to me), and so I'm not positive what the best thing to do is. If I wanted to be completely thorough, I'd do an analysis where I kept the 6 technical replicates separate first, and perform a 6 vs. 6 (and then 6 vs. 6 vs. 6 vs. ...) analysis. I'd also do some checks to make sure the 6 library replicates are always similar.

If the sequencing coverages are similar and there isn't batch-to-batch variation in the library preparation, my intuition says that just adding counts should be fine. You could maybe start to justify that by observing that the sum of negative binomials is still a negative binomial in certain circumstances, and your experimental setup satisfies those assumptions.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by matted 7.8k