Hi all,
For the biological question that I would like to answer for my project, this requires the integration of two to three publicly available datasets to perform pairwise DE analysis between various conditions. As much as we would like to generate a single dataset containing all the conditions (to minimize batch effect), we do not have the capacity to do so.
The two RNA-seq datasets that I would like to use were sequenced under the same platform with the same capture method. These have the following sample sizes:
Dataset 1: Condition A (n = 100) and Condition B (n = 300)
Dataset 2: Condition C (n = 50) and Condition D (n = 20)
Dataset 3 (OPTIONAL) = Condition B (n = 30) and Condition C (n = 100)
In this regards, what is the "best" method that can account for the batch effect while preserving biological differences in the differential expression result? Optionally, does also incorporating dataset3 (which contains shared conditions of dataset1 and 2) and using RUV-seq benefit the correction?
Thanks for the very descriptive reply. I will give it a try nonetheless because this is pretty much the only way to account for the biological question I have in mind. I've read papers that use
RUVg
for batch correction on the basis of negative control genes so that it disregards all assumptions and uses these as anchors. But I agree with you thatlimma::removeBatchEffects
or directly blocking for batch during DE analysis are probably the most straightforward approaches given the batches are known and the effect is linear.