Hello all, I have a sn-RNA seq dataset of 20 samples (ALS and controls) which I used seurat to integrate. In my committee meeting, I was told to check if I'm overcorrecting my data by adding an unrelated sample and check if it mix with other samples or not. I did integrate my samples with one extra sample, pbmc. Below is the UMAP plot of the mentioned dataset+pbmc:
I conclude that the pbmc ,which is not supposed to mix with other samples, is mixed. So, Seurat integration overcorrecting may dataset. I read that Seurat integration CCA tend to do so and using reciprocal PCA would mitigate this over-correction. So, I integrated my dataset using Seurat Rpca. And below is the same UMAP plot integrating same dataset (ALS & control+pbmc).
In this integration it seems that there is not as many celll as the previous one. However, the number is equal. Based on this I concluded that using reciprocal pea is a better approach. However, when I did clustering there is not much difference in terms of mix of pbmc with other samples. The only difference is that in the first umap the pbmc spread widely across clusters as the number of pbmc in each cluster says. but in the second one it is mixed with fewer clusters. But in neither of them I don't see pbmc clustering separately from other dataset. My questions are as below: First, does that make sense to conclude that the second one is less overcorrected compared to the first approach? Second, is it a proper way of evaluation of overcorrection? Third, is there any more principled approach to evaluate overcorrection? FYI, I did try harmony integration as well but didn't end up using it as I did not have enough stable clusters. There is a more principled approach to determine if the data is overcorrected here http://bioconductor.org/books/3.15/OSCA.multisample/correction-diagnostics.html#preserving-biological-heterogeneity However, it is reliable if the samples include same cell types. In my case it does not work. I really appreciate any comment on my questions. Thanks, Paria
I think it makes no sense to combine these data. It's completely different celltypes on top of the batch effect based on different studies. Why do you want to do that anyway, so assuming PBMCs and your data were created perfectly in the same experiment, which analysis would you do on that?
Thanks for your response. I don't need to study integration of pbmc+my dataset. It was just to see if my data is overcorrected. I mean because pbmc is a very different dataset I expect to see a different island in my map plot. However, it is not clustering separately.