Question

Forum:Do batches contain biological variations in omics data (RNAseq or scRNAseq)?

0

Entering edit mode

17 months ago

cwwong13 ▴ 40

This is a theoretical question/ discussion. I have tried to search around but the results were flooded by method paper for batch corrections. This could be more related to the fundamentals of the technologies and how we study sciences. I elaborate my question with sequencing technique, but this might also apply to other omics.

Long story short:

I wonder whether sequencing batches may contain some biological variations. For example, in the case of different patients who had biopsies at different time centers/time points, these samples (and the resulting data) are inherently different (because they are from different patients). I also wonder if there is any literature formally discussing the cases in which we should embrace variability in analyzing data.

I understand that we would like to regress out the batch effects and sometimes also the variability between patients within the same "group". This aims to have good statistical power for detecting differences between groups. However, this correction seems to inevitably cause diminished sample variabilities. Is it possible for us to dissociate between batch effects and inherent biological variability?

One possibility might be sequencing the same sample multiple times across different batches. However, this might not be possible given the cost of having technical replicates for each sample. Therefore, I am looking for if there is any means to digitally dissect these. I tried combat, but one sample for each batch does not make sense to me (or maybe I am wrong).

One step further is all these technologies generate a "snapshot" of the underlying sample. However, in some instances, we would expect there to be variations in batches. For example, cells are in different cell cycle stages. I know there are recommendations for regressing the dependency on the cell cycle when analyzing scRNA-seq data. I believe this is (also) mainly aimed at enhancing the statistical power for doing differential tests. However, I am curious that, in the case of unsupervised learning, should we keep these "batch effects" unadjusted if there is no good way to dissociate it with real biological variations.

Any discussions, comments, and suggestions will be helpful!

combat single-cell omics batch rnaseq • 1.2k views

ADD COMMENT • link updated 17 months ago by Ram 45k • written 17 months ago by cwwong13 ▴ 40

1

Entering edit mode

I think the long and short of it is to treat batch as you would any other variable - if you were studying 3 conditions in 2 strains of mice with N (=3, say) being the recommended number of technical replicated, you'd need 3 2 N samples each from a separate mouse. If you were to split this into sequencing batches, you'll get in trouble no matter what as you cannot account for the extra variable no matter how you split the cohort. However, if you had a lot of samples per condition, you could account for the batch variable without losing out on statistical power.

Do not try to manually distinguish between batch and biology if the experimental design is bad, it's not feasible.

ADD REPLY • link 17 months ago by Ram 45k

Ram · Answer 1 · 2023-12-30

In the end, distinctions between different types of variance are human distinctions that don't have any inhenrent meaning. Biological, technical, and batch variation are all subjective terms.

In general, if variation is from a source we are interested in we call it biological. If we are not interested in it, we generally call it technical variation (this is closely related to what is a technical and what a biological replicate, see more on that here).

Batch effects originally referred to variation introduced by the fact that samples had to be processed in batches. So you might find that samples processed on Tuesday had higher signal than samples processed on Wednesday. In general, that is never a source of variation we are interested in studying. With batch effects like this, correction usually involves including the batch as an explanatory variable in the model that is used to fit the data (usually a negative binomial general linear model in the case of RNA-seq).

So using this definition, variation caused by batch isn't biological variation by definition. However, that doesn't mean that variables that are more biological cannot be confounded by batch, so that it is impossible to isolate the effects of batch from the effects of the "biological" variable of interest. Whether or not patient to patient variation is of interest (and therefore is "biological" or "technical") depends on the statistical question being addressed, but if you sequence all the samples from patient 1 on Monday, all the samples from patient 2 on Tuesday and all the samples from patient 3 on Wednesday, you will never be able to remove day-to-day variation without removing patient-to-patient variation. Thinking about solutions to this leads to an entire field of statistics called "experimental design", and alternatives to simply repeating patients in different batches are things like random-block designs, and repeated measures, etc.

If each patient is only measured once (i.e. not a before and after, but say 200 patients had a single sample taken at one centre, and 200 at a different centre), then there is an inherent assumption in the design, which is that the patients are all drawn from the same random distribution. We are absolutely interested in the patient to patient variation, and don't want to remove this, because it's that variation that allows use to identify this underlying distribution. If you remove batch effects, you are not making each patient (removing patient variation), but you are ensuring that the mean of patients from one centre is the same as the mean of patients from the other. If that assumption doesn't hold, then the design is bad.

The above depends on knowledge of which sample falls into which batch, and sometimes you don't have that sort of information. Tools like combat were invented to deal with situations where batch membership is unknown. However, there is nothing special about "batch", and these tools allow the removal of any source of variation that is orthogonal to a nominated variable of interest. Thus, over time "batch effect" has come to mean any unwanted source of variation that affects a group of samples similarly. That might be things like, what cell cycle stage a cell is at, or the genetic ancestry of a patient (which is generally unknown from the expression data).

One of the downsides of this is that when combat identifies an axis of variation, you don't really know what its source is, just that it is not correlated with your nominated variable of interest. This could well make sense if your statistical question is a comparison of means (you just want to know if your variable of interest affects mean expression of each gene), but may well make less sense for other statistical questions, where the variation might be of interest.

When you ask if it is worth not removing "batch" effects to do something like unsupervised learning, the answer is always "it depends". If what you mean by "unsupervised learning" is clustering, you might find that your batch effect entirely dominates the clustering, so that all the clustering does is to split things into "batches". Only you can say if that is useful or interesting to you, and it might depend on whether you can identify if a clustering is "batch" driven or not.