Hi, I am a newcomer to single-cell data analysis, and I have a simple question that I believe is a piece of cake for you guys. Briefly, I have two single-cell expression datasets (10X platform) from WT and KO samples. CellRanger estimated that WT has 3000 cells, while KO has 1000 cells, and of course, they got different read numbers per cell as WT is 42K means reads, KO is 115K mean reads per cell. Currently, I am working on these data, and simply following Seurat's guide on their website. My question is how would you handle your data analysis starting from such datasets with 3K cells vs 1K cells? I see that people use very different ways to analyze their scRNA data such as SCT or Lognormalize, Wilcoxon or MAST, SplitObject before SCTransforming or not, etc. I don't know which one is the best practice, which way I can follow but in the end, I am trying to get reproducible results. So, I use Seurat's SCTransfrom assuming SCT removes batch effect (!) from the merged dataset as I have different cell and read numbers. Please see what code I use below, and leave a comment regarding the code if you have any suggestions on it. I would also be glad if you could share any code or GitHub link that I can follow in RStudio for an improved analysis because GitHub currently finds thousands of matches once you type Seurat, 10X data, or single-cell, etc. Best, Volkan
library(Seurat)
library(patchwork)
library(cowplot)
library(ggplot2)
WT.data <- Read10X(data.dir = "~/Desktop/A1-filtered_feature_bc_matrix")
KO.data <- Read10X(data.dir = "~/Desktop/A2-filtered_feature_bc_matrix")
WT <- CreateSeuratObject(counts = WT.data, project = "WT", min.cells = 3, min.features = 200)
WT[["percent.mt"]] <- PercentageFeatureSet(WT, pattern = "^mt-")
VlnPlot(WT, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
WT <- subset(WT, subset = nFeature_RNA > 200 & percent.mt < 5)
KO <- CreateSeuratObject(counts = KO.data, project = "KO", min.cells = 3, min.features = 200)
KO[["percent.mt"]] <- PercentageFeatureSet(KO, pattern = "^mt-")
VlnPlot(KO, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
KO <- subset(KO, subset = nFeature_RNA > 200 & percent.mt < 5)
sample.integrated <- merge(x = WT, y = KO, add.cell.ids = c("WT", "KO"), project = "MyProject")
all.list <- SplitObject(sample.integrated, split.by = "orig.ident")
for (i in names(all.list)) {
all.list[[i]] <- SCTransform(all.list[[i]], verbose = FALSE)
}
sample.features <- SelectIntegrationFeatures(object.list = all.list, nfeatures = 4000)
sample.list <- PrepSCTIntegration(object.list = all.list, anchor.features = sample.features)
sample.anchors <- FindIntegrationAnchors(object.list = sample.list, normalization.method = "SCT",
anchor.features = sample.features)
sample.integrated <- IntegrateData(anchorset = sample.anchors, normalization.method = "SCT")
sample.integrated <- RunPCA(object = sample.integrated, verbose = FALSE)
sample.integrated <- RunUMAP(object = sample.integrated, dims = 1:30)
sample.integrated <- FindNeighbors(sample.integrated, reduction = "pca", dims = 1:20)
sample.integrated <- FindClusters(sample.integrated, resolution = 0.5)
p1 <- DimPlot(sample.integrated, reduction = "umap", group.by = "orig.ident", pt.size = 0.2, cols = c("blue", "pink"))
p2 <- DimPlot(sample.integrated, reduction = "umap", label = T, label.size = 4, repel = T, pt.size = 0.2)
plot_grid(p1, p2)
all.markers <- FindAllMarkers(sample.integrated, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
top10 <- all.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(sample.integrated, features = top10$gene) + NoLegend()
DefaultAssay(sample.integrated) <- "RNA"
Conserved_cluster0 <- FindConservedMarkers(sample.integrated, ident.1 = 0, only.pos = TRUE, logfc.threshold = 0.25, grouping.var = "orig.ident")
and so on.
Thank you rpolicastro for your reply, I appreciate it. I saw several codes like what you said. But, I was confused, as its vignette says SCT replaces NormalizeData, ScaleData, and FindVariableFeatures. So that's why I was ignoring additional normalization. Please correct me if I am wrong, you suggest that I need to get back to RNA as default assay, then I need to run NormalizeData to perform further steps like finding markers, FeautrePlotting, or DEG analysis etc right. So, in this case, should I add this normalization step before FindAllMarkers, or before FindConservedMarkers?
I mean is this correct?
Thanks a lot.
You should add it before
FindAllMarkers
.This question is addressed in the Seurat FAQs here about why you should use the counts from the RNA slot.
Thank you rpolicatsro, totally understand what you mean. I'll also reanalyze my data by applying the parameters mentioned on the GitHub link. Best, Volkan