I have eight 10x datasets (samples). 4 samples facs sorted for cell type a and four samples for cell type b. I want all the data to be analysed together. Just merging them works but I think there are some sample-to-sample technical artifacts. Integrating all of them also seems to be giving strange results. So I decided I would integrate all sample for cell A (dataset x) and separately for cell type B (dataset y) and then merge both datasets.
lst <- SplitObject(sf,split.by="sample")
# integration for cell type a samples
microlist <- lst[1:4]
for (i in 1:length(microlist)) {
microlist[[i]] <- SCTransform(microlist[[i]],return.only.var.genes=F)
}
sf.features <- SelectIntegrationFeatures(microlist,nfeatures=5000)
microlist <- Seurat::PrepSCTIntegration(microlist,anchor.features=sf.features)
sf.anchors <- FindIntegrationAnchors(object.list=microlist,normalization.method="SCT",anchor.features=sf.features)
x <- IntegrateData(anchorset=sf.anchors,normalization.method="SCT")
# integration for cell type b samples
monolist <- lst[5:8]
for (i in 1:length(monolist)) {
monolist[[i]] <- SCTransform(monolist[[i]],return.only.var.genes=F)
}
sf.features <- SelectIntegrationFeatures(monolist,nfeatures=5000)
monolist <- Seurat::PrepSCTIntegration(monolist,anchor.features=sf.features)
sf.anchors <- FindIntegrationAnchors(object.list=monolist,normalization.method="SCT",anchor.features=sf.features)
y <- IntegrateData(anchorset=sf.anchors,normalization.method="SCT")
My two integrated datasets look like this:
> x
An object of class Seurat
36300 features across 18928 samples within 3 assays
Active assay: integrated (5000 features, 5000 variable features)
2 other assays present: RNA, SCT
2 dimensional reductions calculated: pca, tsne
> y
An object of class Seurat
35419 features across 20499 samples within 3 assays
Active assay: integrated (4354 features, 4354 variable features)
2 other assays present: RNA, SCT
2 dimensional reductions calculated: pca, tsne
And then I merge x and y
z <- merge(x,y)
z
An object of class Seurat
39382 features across 39427 samples within 3 assays
Active assay: integrated (7050 features, 0 variable features)
2 other assays present: RNA, SCT
And then I want to run a PCA, but I can't. Note that the number of variable features has become zero after merging. And then I tried to run FindVariableFeatures()
> FindVariableFeatures(z)
Calculating gene variances
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Error in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
invalid 'x'
Running NormalizeData/ScaleData
doesnt make a difference. I am not sure it even makes much sense to run FindVariableFeatures
after scTransform and integration and merging.
What are the thoughts on merging two separately integrated datasets? Are their other tools/methods that actually return corrected raw count values that can be used as the starting point for Seurat?
R 4.0.2
Seurat 3.2.3
Update: Seurat developers have said that they do not support FindVariableFeatures()
on integrated datasets. This is their response:
At the moment we don't support running VariableFeaturePlot on integrated datasets, but you can run this function on individual datasets prior to integration. You can however use SelectIntegrationFeatures to identify features that are consistently variable across datasets. You could also take the union of variable features if you wish. Also, merging two integrated objects is not recommended because you are only normalizing within cell types here.
What strange results was the analysis giving?
Strange as in, it didn't fit well with biology. Normally, we would expect similar cells from different samples (runs) to cluster together and dissimilar cells should cluster distinctly. The unintegrated dataset shows distinct clusters for distinct cell types (which is good), but similar cell types also show some distinct clustering by sample (which is bad). So there is a need for some sort of correction. After full integration with all samples, cells that should be distinct also tend to be merged together (this is bad). Maybe overcorrection? Or unintended consequence of integration?
In addition, unintegrated dataset shows clear sharp expression of diagnostic marker genes in mostly distinct clusters. After integration, the marker genes expression is more fuzzy, blurred and less distinct by cluster.