Question

Seurat-Merging and integrate

3

Entering edit mode

3.6 years ago

francesca3 ▴ 160

Hi everyone. I have one question. I already read about the difference between merging and integrate function in Seurat, but I want some advice about my specific case. I have samples from two different patients and for each one I have the control and the "treated". Like this

                            Control          Treatment 
        Patient A           1Ac              1At 
        Patient B           1Bc              1Bt

The "treated" sample is represented by a subpopulation of the control one, since the only difference with the control is that the cells have been selected for a marker expressed just by a fraction of the total population. 1)I started analyzing the two patients independently, just looking at the clustering of the treated with respect to the own control. In this case, to combine the two samples (1Ac vs 1At ; 1Bc vs 1Bt) I used the function "merge". The clustering was pretty good for both the patients.

2)I moved on looking at the controls from the two patients to check if the samples were similar or not (1Ac vs 1Bc). I used "merged" but the two patients clustered in a separate way apparently not having nothing in common. At this point, I discovered about the "integrating function", that seems to be more appropriate when you are dealing with differences that could be due to natural patient variability. I applied it and the clustering was much more better even if I could still see differences between the two patients in the distribution among the clusters.

My question is: do I have to apply the "integrate" function instead of "merge" also when I study each patient independently (1Ac vs 1At ; 1Bc vs 1Bt) if I decide to present all these data together? Is it accepted to have a different way of combining the datasets according to the analysis level? Of course the clustering changes a bit but I don't think it is necessary to apply a sort of intra-patient batch correction (at least looking at the UMAP).

Sorry for the long post, but I'm just starting to approach single cell analysis and I have a lot to learn.

Thanks

Francesca

single-cell seurat • 6.0k views

ADD COMMENT • link updated 3.5 years ago by Igor ▴ 50 • written 3.6 years ago by francesca3 ▴ 160

score 3 · Answer 1 · 2021-11-09

I personally always (and only) integrate the samples. Probably, I have never used merge in production. The main reason being that no matter the experiment design there is always some degree of variability between your samples (different technologies, different donors, different disease, different sex, different age,...) and the integration gives you more control on how you deal with it.

If you want to avoid running the integration if you don't need to (limited time or computational resources), a possible approach would be:

merge the datasets
re-cluster
check for any batch effect between the two source.

If your cells are not clustering based on any plausible confounder (sex, experiment, age, patient,...), then merge was enough, otherwise you should integrate your datasets.

I suggest you to look at the seurat repository on git, as this topic as been asked multiple times.

score 2 · Answer 2 · 2021-11-22

My workflow looks like this:

Merge samples into one Seurat object
Make QC plots, split by sample. Merged object allows to easily examine different features on one plot
Filter cells and genes, depending on batch structure (see below).
Split merged object into samples again
Integrate
UMAP, clustering, etc.

but I don't think it is necessary to apply a sort of intra-patient batch correction (at least looking at the UMAP)

Were the cells sorted on the same date, with the same instrument? Did sequencing happen on the same day? What I'm trying to say is, are you sure there's zero batch effects between samples? How can you tell the biological and technical differences apart? As for making this decision by only looking at UMAP, I'm not sure it's a good idea. For example, suppose you see two cell populations that are somewhat close, but don't exactly overlap. Do you think these represent technical differences between the samples, or are bona fide cell types that happen to be differentially abundant?

That being said, I don't know how good the integration works with just the two samples.