Hello,
I have found myself scratching my head after applying Seurat's current (v4) FindIntegrationAnchors
and IntegrateData
pipeline which relies on CCA
, to my scRNAseq dataset. I am wondering if I am eliminating biologically relevant signal as well.
In my data, there are 16 phenotypically distinct samples which were prepared in the wet lab at different days (we are talking about timepoint progression of ~14 days between the initial timepoint and the last timepoint). There is one initial seeded batch of cells which split into three parts and subject to:
- no treatment
- treatment A
- treatment B
For all the three branches we collect cells at several progression timepoints (day 3, day 6, day 9, day 12 and day 14), giving us in total
- 5 samples for the non-treated condition
- 5 samples subjected to the treatment A
- 5 samples subjected to the treatment B
plus the initial initial batch at day 0.
The samples after collection were fixed on the respective days.
The fixed samples were then encapsulated, had libraries prepared for them, and sequenced at one go, in one batch, hence there is no sequencing batch effect.
But there is most certainly a batch effect associated with days of collection of samples.
When I apply Seurat's integration and correction, the samples form a single blob of cells on all embeddings (left column for corrected embeddings and the right column for the uncorrected):
Two questions:
- Is our experiment utterly borked or Seurat is too zealous in its batch-correction?
- Should I even apply the batch correction given the fact that these aren't the same cells that come from different batches, but (supposedly) phenotypically different cells?
I normally would be a proponent of ALWAYS applying batch effect correction in the single-cell RNAseq setting, just in case. But here - am I doing it right?
great analysis and greatly explained, thank you!!
It was always my understanding that whenever we have samples that were not prepared under the exact same conditions, like different day of preparation, for example. Then we should try to account for that in the data. Am I bamboozling myself on that?
As pointed out in this thoughtful and complete answer, the key question is the ratio of "biological effect" to "technical effect". Something that's not quite clear from your description is: how heterogeneous are your "cells"? One of the reasons that integration works is that a diverse population of cells constitute a whole tissue, providing a significant amount of "biological variation" to exploit when mapping axes of variation onto one another.
If, as I suspect is the case here, the initial population of cells is homogenous (i.e., a single cell line) then there is limited "ground truth biological variation" to use to anchor. As such, I agree with the original response assertion that I would not a batch-effect correcting integration; but in my analyses I would certainly include the batch as a covariate.
Good question. Yes the cells come from a single cell line initially. During the process of pluripotency induction, which is the case here, I would assume that they go through some common (across the time-points) de-differentiation states, generating common cell states but of varying proportions.
Regarding your question about
SCTranform
, I've to said that I'm not that familiar with its performance in practice. I mean I know the method and what it aims to deal with, i.e., stabilize better the variance of genes, particularly of lowly expressed genes, but I do not have used it enough to have a personal opinion about its performance. Although I know people that has been using it and they're happy with it.Recently, Ahlmann-Eltze & Huber, 2023 showed that log-normalization, as it has been implemented in
NormalizeData()
Seurat
function (with defaults) as well other tools, e.g.,scanpy
, works as well or better than other transformations, e.g.,SCTransform
. Usually I stick with thelog-normalization
transformation as it is easier for me to understand and compute.I guess the choice might depend also on the use case you might have.
Best,
António
Nice pointer for the paper!!
From personal experience I would agree, I also initially dabbled in using the
SCTransform
when it was all-new and all-the-rage initially, especially since it was portrayed as the new, cutting-edge once-and-for-all solution, but then there were papers (couldn't point to right now) showing that it's not much more efficient than the simple lognormalisation.Log-counts
are also much more transparent, so... I also stick to them now.and by the way, what about the
SCTransform
integration pipeline fromSeurat
, do we know how bio-conservational it is as compared toCCA
andrPCA
?