scRNA-seq novice here: We have four 10X scRNA-seq samples (wildtype and knockout condition) as n=2 each. Each pair (so one WT and one KO) was produced on the same day respectively, same FACS sorting machine, same lab, same technician etc, so avoiding batch effects as much as we could.
For comparative analysis between the conditions I went through the scran
/ OSCA workflow and now aim to integrate the datasets. Essentially the choice is now to either merge the datasets without explicit batch correction via fastMNN (and only do per-sample depth correction via multiBatchNorm
to ensure equal depth across the already normalized samples) or to apply fastMNN. I tested and visualized both approaches for every replicate independently, see below, and see quite different results.
Both replicates (if no fastMNN is applied) show a reproducible separation by condition (which we expect), so probably the influence of condition is greater than any batch effect. When applying fastMNN the two conditions lose this separation.
Therefore my question: Are there situations where batch correction masks interesting biological features. Given that we see reproducible separation by condition, could it be more meaningful to not apply fastMNN? If I combine the datasets and only correct cor batch = day (so rep1 is one batch and rep2 is one batch) I manage to preserve the separation by condition. The tSNEs then pretty much look like the left panel in the plot below.
Comments and your experiences with this are appreciated.
Absolutely.
Can you explain the
fastMNN
application in your case a bit more? Did you run it on all four samples or separately on the pairs? What does the UMAP for all four samples look like without any batch correction?To me it appears that the second approach is probably the most meaningful one as it removes the modest batch effect induced by the different library prep. days while leaving the differences in condition untouched (which is what I am interested in).
I agree. This is akin to integrating just the WT and just the KO, i.e. correcting for the technical influence of the day. Generally it seems like you really did a pretty good job in keeping the batch effect fairly low given how close the cells of the individual conditions track each other even without batch correction.
(I edited my response after I read your responses more carefully).
So it is evident from your merge data analysis that batch correction is needed for integration. However, I feel that if you are performing batch correction by day of prep, you are merging the two condition as one object. Then the replicates are being corrected with the rep1 as reference. To me it sort of seems biased. I would have preferred batch correction by individual experiments.
Could you just analyze the two experiments separately, perform clustering and get markers. Then check when you perform batch correction (all individual samples as separate batch), do you see clusters with similar markers overlapping?
Hope you have already took care of this, but the order of sce objects is important for fastMNN, when you supply the list of sce objects, so they should be: WT1, WT2, KO1 and KO2. (WT and KO are interchangeable obviously). If you do these things, hope you post your analysis, I am interested in seeing the results.