I have a database of 10 distinct biological samples. Each of these samples was sequenced (RNA-Seq) using paired-end reads and 6 barcodes. Thus, for each of the ten biological samples, I have 12 .fastq files, with names like ATACTC_1, ATACTC_2, GTGCTC_1, GTGCTC_2...and so on.
I would like to follow this pipeline:
- Analyze data quality with FastQC
- Trim data with Trim Galore!
- Align with TopHat2
- ??? generate counts, differential expression analysis, etc.
Here is my question: at what point in this pipeline can I (should I) combine all of the .fastq data together? If each sample has 12 files associate it, at what point do I collapse the 12 files into a single file representing the RNA-Seq data for a single biological sample that I can analyze for counts in step #4?
I'm guessing I combine all of the .fastq files up front (re-multiplex?) with cat file1...file12
. I could also do steps #1-2 or steps #1-3 completely, and then combine the output of step #3.
Thank you for any help you can provide! This board has already been tremendously helpful to me.
Ah, that's so helpful! Thank you very much.
Actually, a question for clarification: I suspect that the 6 barcoded runs for each sample are not technical replicates, but actually 1/6 the volume of the total sample library (so the libraries are not the same, although they all come from the same sample). In that case, do need to analyze each file separately and then treat 'sample' as a cofactor in any differential expression analysis, or is there still a way to combine the data?
EDIT: Oops - it occurs to me that this still could fit the definition of a technical replicate, in which case it would be fine to combine just as you suggested.
That's a good question about the separate library replicates. To be honest, it's somewhat of an unusual design (to me), and so I'm not positive what the best thing to do is. If I wanted to be completely thorough, I'd do an analysis where I kept the 6 technical replicates separate first, and perform a 6 vs. 6 (and then 6 vs. 6 vs. 6 vs. ...) analysis. I'd also do some checks to make sure the 6 library replicates are always similar.
If the sequencing coverages are similar and there isn't batch-to-batch variation in the library preparation, my intuition says that just adding counts should be fine. You could maybe start to justify that by observing that the sum of negative binomials is still a negative binomial in certain circumstances, and your experimental setup satisfies those assumptions.