I have three 10X runs. If I merge them with CreateSeuratObject()
, this is the distribution of genes detected for each run (batch) (nFeatures_RNA).
If I integrate them, using hvg approach or sctransform approach. (Showing sctransform based integration below)
library(sctransform)
su.list <- SplitObject(su,split.by="orig.ident")
for (i in 1:length(su.list)) {
su.list[[i]] <- SCTransform(su.list[[i]],return.only.var.genes=F,vars.to.regress=c("mito_percent","S.Score","G2M.Score"))
}
su.features <- SelectIntegrationFeatures(su.list,nfeatures=3000)
su.list <- Seurat::PrepSCTIntegration(su.list,anchor.features=su.features)
su.anchors <- FindIntegrationAnchors(object.list=su.list,normalization.method="SCT",anchor.features=su.features)
su <- IntegrateData(anchorset=su.anchors,normalization.method="SCT")
I get this gene distribution for each run. Number of genes have now been recalculated based on the 'integrated' assay containing 3000 genes.
Now, is this a batch effect? How would I correct for this? Should I run sctransform()
or NormalizeData()
again using 'vars.to.regress' on 'integrated' assay? Similarily, how would I corrected for mito_percent, cell cycle etc in an integration workflow? Should I regress them before or after integration?
Thanks.
Something is wrong. nFeatures_RNA does not get modified by either
sctransform()
orNormalizeData()
. Additionally, notice that the number of genes drops for each of your batches. You are doing something else besides integration between the first and second plot. You should check that the size of your Seurat object is the same before and after. You should also post the exact commands you used.I also had the feeling something is wrong. Then, I have looked into it and I don't see anything obviously wrong. So, in the first plot, I use nFeatures_RNA, In the second plot, I manually recalculate num of genes detected from 'integrated' assay using
colSums(GetAssayData(su,assay="integrated")>0)
. 'nFeatures_SCT' also does not have this change in distribution since RNA assay and SCT assay both have all the initial set of genes. 'Integrated' only retains 2000 genes by default. I have changed it to 3000 in my case.I suspected something wrong in the sctransform based integration workflow, so I tried the hvg based integration workflow and the integrate assay looks similar in distribution.
So, during the integration step where 3000 cells are selected, somehow it must be picking cells with lower number of genes detected for sample 1 and 2.
colSums(GetAssayData(su,assay="integrated")>0)
does not give you number of detected genes. It gives you the number of detected genes that were used for the integrated analysis. In your case, batch 3 has higher expression of those genes.