This is a theoretical question/ discussion. I have tried to search around but the results were flooded by method paper for batch corrections. This could be more related to the fundamentals of the technologies and how we study sciences. I elaborate my question with sequencing technique, but this might also apply to other omics.
Long story short:
I wonder whether sequencing batches may contain some biological variations. For example, in the case of different patients who had biopsies at different time centers/time points, these samples (and the resulting data) are inherently different (because they are from different patients). I also wonder if there is any literature formally discussing the cases in which we should embrace variability in analyzing data.
I understand that we would like to regress out the batch effects and sometimes also the variability between patients within the same "group". This aims to have good statistical power for detecting differences between groups. However, this correction seems to inevitably cause diminished sample variabilities. Is it possible for us to dissociate between batch effects and inherent biological variability?
One possibility might be sequencing the same sample multiple times across different batches. However, this might not be possible given the cost of having technical replicates for each sample. Therefore, I am looking for if there is any means to digitally dissect these. I tried combat
, but one sample for each batch does not make sense to me (or maybe I am wrong).
One step further is all these technologies generate a "snapshot" of the underlying sample. However, in some instances, we would expect there to be variations in batches. For example, cells are in different cell cycle stages. I know there are recommendations for regressing the dependency on the cell cycle when analyzing scRNA-seq data. I believe this is (also) mainly aimed at enhancing the statistical power for doing differential tests. However, I am curious that, in the case of unsupervised learning, should we keep these "batch effects" unadjusted if there is no good way to dissociate it with real biological variations.
Any discussions, comments, and suggestions will be helpful!
I think the long and short of it is to treat batch as you would any other variable - if you were studying 3 conditions in 2 strains of mice with N (=3, say) being the recommended number of technical replicated, you'd need 3 2 N samples each from a separate mouse. If you were to split this into sequencing batches, you'll get in trouble no matter what as you cannot account for the extra variable no matter how you split the cohort. However, if you had a lot of samples per condition, you could account for the batch variable without losing out on statistical power.
Do not try to manually distinguish between batch and biology if the experimental design is bad, it's not feasible.