Question

Are we tricking ourselves with batch effect correction?

44

Entering edit mode

9.9 years ago

Christian ★ 3.1k

This is probably a stupid question, but I am new to this topic so please forgive.

I have an RNA-seq data set where preliminary analysis suggests the existence of a technical batch introduced either at the library prep or sequencing step, maybe both. In a PCA plot constructed purely from global sample read count statistics (including the ratio of exonic to intronic/intergenic reads, the ratio of rRNA reads to exonic reads, and the duplication rate -- nothing of which should reflect true biological variation), I can clearly see that the samples clustering together in this plot are typically the same as in the PCA plot I generated from normalized gene expression values. Very bad...

Now, I set out to correct for this effect using ComBat, which to my surprise worked really nice. Samples are now clustering perfectly by biological subgroup. But that looked almost too good to be true. So I conducted a control experiment where I randomly permuted my batches. And voila, again I get perfect clustering by biological subgroups!

This brings me to my question: since both SVA and ComBat require the variable of interest (e.g. the biological subgroup) as model parameter, aren't these methods strongly biasing the expression values towards this endpoint? I understand that this is kind of the point of batch effect correction, but how can we be sure that these methods indeed eliminate true/meaningful batches and not just overfit the data? I can't say that I fully understand these methods, but my hunch is that in such high-dimensional data sets like those from RNA-seq you always find random covariates to factor away such that the data fits the desired outcome. And then we say: Eureka, it works, now the data makes perfect sense!

How can we be sure we are not tricking ourselves with these methods?

combat batch-effect sva • 27k views

ADD COMMENT • link updated 2.2 years ago by madbadradscientist ▴ 20 • written 9.9 years ago by Christian ★ 3.1k

10

Entering edit mode

Yesterday I've heard that some of my former co-authors had to retract a Nature paper due to invalid statistics for the null model. Ok, well that is a tough pill, that can happen to anyone. It is not easy to get that right. In fact some years back when I was less experienced not being sure if I did an analysis right used to keep me up at night ... I am still not sure if I do it right but I sleep better ;-)

But what was really eye opening to me is that when I read the paper itself it lays out extremely convincing plots, visualizations and effect sizes that show strong and radical effects that are based on the data. Only that the data in turn was no better than random noise! But then how could that possibly happen - the answer is that the dimensionality of the space is huge. There is always something that seems to fall into place. I know the authors, both are well intentioned and honest scientists. But then it also shows better than anything just how easily one can get fooled when dealing with methods they don't fully understand. You can get very convincing plots out of what is essentially random noise - I think this paper should be taught in school!

In a nutshell my personal opinion is that batch effect corrections are a valid approach when done right - but then I am also convinced that they do far more damage than good as God knows how many scientists that do not understand it apply it on their data and discover wondrous new phenomena that are not actually there.

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 9.9 years ago by Istvan Albert 102k

0

Entering edit mode

Yes, proper negative controls are key (also in computational research!). In my case the negative control using permuted batches was kind of eye-opening and triggered this question. I would still be very much interested in an answer, but maybe this is really a fundamental limitation of batch correction methods we have to live with.

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 9.9 years ago by Christian ★ 3.1k

0

Entering edit mode

But the data from this paper did not have batch effect, right?

They did not use proper background control, which led to wrong conclusions.

However, if somebody could volunteer on how to look for potential batch effect in samples, that would be highly appreciated and very useful.

Thanks!

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 9.9 years ago by Chirag Nepal ★ 2.4k

0

Entering edit mode

yes correct, the point I wanted to make is more about using a tool such as a pattern matcher or say batch effect normalizer that one does not fully understand yet could produce very strong signal

ADD REPLY • link 9.9 years ago by Istvan Albert 102k

0

Entering edit mode

While I have my own thoughs on this, I just tweeted this post to Jeff Leek. Hopefully he can write a reply since he's obviously thought quite a bit about SVA/Combat.

ADD REPLY • link 9.9 years ago by Devon Ryan 105k

Ram · Answer 1 · 2016-06-04

Hi Christian

I found your post a little late, but I found it so interesting that a will give it a late answer. In your description of your experiment, you do not inform of the experiment design, if the batches contained the same proportion of samples from your biological groups. Based on your observations I am pretty sure it was quite unbalanced.

A couple of years ago I went trough the same experience as you. I had a severely unbalanced data set with batch effects which I tried to salvage using ComBat in what seemed like the recommended way. I thought my results was too good to be true. So I did, as you, several sanity checks with permuted labels and random numbers. The results were equally good clusterings of the "groups" and long lists differentially expressed genes. After some more reading and head-scratching I could not figure out what I did wrong, so I asked for help at the bioconductor list: Is ComBat introducing a signal where there is none?

However, I was unable to attract any responders. Anyway, batch effects was such a common problem at my work that we just had to figure this out. With the help of experienced and more mathematically skilled colleges, we looked further into this and wrote a paper "Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses"

Our advice is:

For an investigator facing an unbalanced data set with batch effects, our primary advice would be to account for batch in the statistical analysis. If this is not possible, batch adjustment using outcome as a covariate should only be performed with great caution, and the batch-adjusted data should not be trusted to be "batch effect free", even when a diagnostic tool might claim so.

So instead of creating a batch effect free data set, one should strive to keep the data as is, with all effects, and include the batch (and other) effects in the analysis, for instance blocking for batch in limma.

I will try to answer your questions

Are we tricking ourselves with batch effect correction?

Yes, if we treat the new created data set as "batch effect free", i.e. assuming this is what data would have been if we did not have batch effects in the first place. The batch correction tools which creates a new data set first need to estimate the batch effect and then remove that estimate from the data. However this estimate do have an estimation error, so what happens in practice is that the original batch effect is replaced with the estimation error. You will still have batch effects (the estimation errors) in your data.

How can we be sure that these methods indeed eliminate true/meaningful batches and not just overfit the data?

In the project I was working on we wanted to analyse the data with a tool that could not itself correct for batch-effects (like for instance limma can), therefore I had to use ComBat anyway. However I did not include the biological groups as covariates, and when doing PCA on the batch-adjusted data set, my groups seemed to cluster. This convinced me that the variation in my data now was more due to biology than to the batch effect. However, my data would still contain batch effects, and the results later on had to be treated with caution.

By doing the before and after plotting there are ample possibilities to trick your self into think that the batch effects is gone and you are left with a group effect. This can be achieved with clustring/pca plots, but also the more dedicated tool PCVA has been used. PCVA estimates the effects in a data set. If you feed PCVA with the same information (batch,group) as you did in the batch removal step, then no batch effect will be estimated, if it was, it would also have been estimated by the removal-tool and removed in the "clean" data. The remaining batch effect (the estimation error) will not be detected as a batch effect, but some of it will be identified as a group effect if your data set is very unbalanced. I found this by doing a lot of simulations with known group/batch effects.

How can we be sure we are not tricking ourselves with these methods?

By knowing their limitations, i.e the "clean data" will contain batch effects as well. The more ideal solution would be not to use them, preferably by not having batch effects in the first place, or having a balanced design, or account for batch in the statistical analysis to be performed.

Ram · Answer 2 · 2015-06-05

6

Entering edit mode

9.8 years ago

Marge ▴ 320

Hi Christian,

Thank you for sharing this.

I have been thinking a lot about batch correction problems in the particular scenario in which you don't know the batches a priori but you are pretty convinced that they are there. Let's say you are analysing a medium sized dataset (maybe 150 samples) collected over several years, with data produced by many different people in different labs and using a protocol that spans 100 steps spread over several days, therefore by itself subject to technical variation. You have part of the information relative to common confounders (e.g. where samples were prepared, the operator, flow cell, machine, ..., ...) but not all of them and not for all samples.

Now imagine that the hypothesis behind the experiment is the presence of differences between a number of conditions, but when you do a first visual exploration of the data none of the hypothetical conditions is "visible". Why is this? Either the hypothesis is wrong (or maybe not as clear as the planners of the experiment would have expected) or there is some (likely several, in the scenario I propose) confounder/s making things fuzzy.

Now you are only left with two options: you either use the data as is or you try to apply an approach able in some way to extrapolate possible batches based on the assumption that the biological groups are characterized by differences in expression. Of course if you go for the latter you can only use your hypothesis for the model and you are unavoidably pushing the data to fit it.

I think the only way to have a confirmation that you are not overfitting is replication (besides, of course, all the theory, proof and explanation that comes together with the method). I would imagine that if the adjustment is robust and the biological separation into groups is sound, then you should be able to find e.g. the same differentially expressed genes if you do some sort of 2-fold cross validation (but of course, you need a large dataset to really try this).

Does this make sense?

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 9.8 years ago by Marge ▴ 320

0

Entering edit mode

I think the only way to have a confirmation that you are not overfitting is replication

That sounds promising. Could you elaborate on this? I suppose you mean that you take the first half of the data, learn a model for batch correction from it, use this model to correct batches in the second half, and examine the results? If on this second half you also see an improvement (i.e. more of the thing you want to see), we assume batch correction did something meaningful. Correct?

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 9.8 years ago by Christian ★ 3.1k

0

Entering edit mode

Mmmh, I didn't quite think about it in these terms and I don't think it would work. In my very naive view it would be possible that the two subsets require different correction schemes because of their own distribution of confounders, so intuitively speaking I can imagine the replication cannot work like this.

I was instead thinking in more blunt and downstream-oriented terms. Imagine you have a case/control dataset for a given disease and you are interested for example in differences in expression between the two groups. If you have enough samples you could split the set of samples into two groups with a similar number of cases and controls, then apply your batch-finding-and-adjustment approach in each of the two subsets and go all the way down to the results of a case-control differential expression analysis. Now, if differences in expression between cases and controls are mostly related to the general mechanisms of the disease then the two analyses should give you highly overlapping output.

Of course, then you can start wondering whether the differences you observe are a cause or a consequence of the disease, but this is yet another story and has little to do with confounder correction: at least at this stage you would be reasonably sure that you identified a set of genes that is robustly dysregulated in the given disease and this is not due to any kind of counfounder (even though, as Istvan, I also believe that there may be completely unexpected structures in the data that amazingly fit what we are looking for... But I also sleep better today days ;-)). Ultimately if there is large overlap between the two groups then you may also be convinced that you didn't overfit with your correction scheme.

Even more bluntly, I have to say that I was initially thinking about replication of your study with another study (which is, depending on what you are studying, sometimes easy and sometimes simply impossible). In principle, if the effect is "real" and "general" and both the adjustment and the analysis are robust, then applying the same approach to a different dataset with the same type of samples should give if not the same result, at least something largely overlapping. I came to understand in the years biology doesn't work exactly like this, but I still think this should be the essence of replication (we are looking for the general mechanisms after all: if our findings only apply to a single dataset then we are a bit doomed...)

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 9.8 years ago by Marge ▴ 320

Ram · Answer 3 · 2015-06-03

5

Entering edit mode

9.9 years ago

Neilfws 49k

To address your question "How can we be sure we are not tricking ourselves with these methods?", one short answer is "simulated data".

It's possible to create artificial datasets with desired properties: e.g. with/without batch effects, with/without biological effects, then see whether they behave as expected when processed using the method you're developing. This is typically how statisticians test the performance of their algorithms.

Recent article A multi-view genomic data simulator may be of interest in this regard.

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 9.9 years ago by Neilfws 49k

1

Entering edit mode

The problem is that good data is well defined but bad data is not. Time and again I am surprised just how weird and unexpected systematic errors can be present in the data. Hence it may not be possible to simulate the type of problems one may actually have.

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 9.9 years ago by Istvan Albert 102k

0

Entering edit mode

True, simulated data can prove that a method works for known batches, at least in principle. But what about false-positives? In practice we often don't know the true batch (experimenter, reagents, library, lane, you name it), so we make assumptions based on clustering results and surrogate markers (like date) and run batch effect correction anyways, just to be on the safe side.

What I get from my little control experiment with randomized batches is that you have to be extremely careful with this attitude, because data will always get massaged towards the desired endpoint, regardless of whether your batch is well defined/justified or not. From the results (improved clustering) we convince ourselves that data quality improved, when in reality we might have just as well distorted expression levels in some samples to a degree where they no longer reflect the truth.

I guess what I learned so far is that batch effect correction is powerful but dangerous, and should only be applied when convinced of the existence of a particular batch.

Thoughts?

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 9.9 years ago by Christian ★ 3.1k

score 0 · Answer 4 · 2023-01-29

For what it's worth, I recently developed a method that addresses this problem, called ConDo. It learns a transform that is applied identically for all values of the biological variable(s) of interest. This fixes the statistical problem you've discussed above, where one produces fake biological signal. Here's the paper: https://arxiv.org/abs/2203.12720, and here's the code: https://github.com/calvinmccarter/condo-adapter.

Incidentally, this approach also allows you to perform batch correction on test data, where you haven't observed the biological variable. This is especially useful in settings where you're trying to predict the biological variable of interest, and you need to perform batch correction before making this prediction.