Hi all,
I have RNA-seq data from a single multifactorial experiment that was run in two batches. I unfortunately had no input into which samples were run at which time, and the groups in each of the batches were completely different (e.g., all biological replicates from groups A-E were run in Batch 1 and all biological replicates from groups F-J were run in Batch 2). There are no samples or groups that were included in both batches (controls were only run with Batch 1).
- The samples cannot be re-run, nor can I repeat the experiment (there is no $ to do so).
I am trying to figure out my options here, since I am in charge of analyzing this suboptimal RNA-seq data. I was thinking of using ComBat-seq to adjust for batch effect without specifying biological covariates (I want to avoid overfitting) and then proceeding with analysis as usual. What are people's thoughts on this approach? I'm not sure what else I can do at this point since my hands are tied in terms of what happened to the samples/data before it got to me.
It would help if you clarify "run" part further. What exact part of the experiment was done in two batches. Full experiment in two sets or just libraries or just sequencing? Was there a common control in the two batches.
Just sequencing, no common control unfortunately.
There should be no appreciable batch effect because of sequencing, as long as following is true. The same sequencer (or at least type i.e. 2 color, flowcell type) was used for the two batches. Yield of reads per sample is similar. You can track which samples came from which flowcell and check using PCA etc but it would be surprising if there is a batch effect due to sequencing alone.
I assume you are confirming that actual experiment, collection of samples and preparation of libraries was done at the same time by same person using a common protocol.
The actual experiment and collection of samples were done at same time. Library prep was definitely done by same person using same protocol, only thing I am not completely certain about is timing of library prep (pretty sure was done at same time but have an email out to confirm that now). Same sequencer should have been used, also confirming that.
Well, dangit - I just heard back and the same sequencer was used but library prep was done separately. Bummer. Thanks for the assistance.
Since you are stuck with what you have keep this is as another variable (hopefully it is same as two sequencer batches). Generally people (if from cores/companies) are consistent as long as they are following SOP for preps.
If the samples were collected at different times then ... hope that you can get something usable.
this is a problem of perfect separation.
if there are highly analogous data in a public repository, you may have some options, but generally you'll never truly know what a given difference was attributable to