Take my comment with a grain of salt because I am not an active Seurat (but Bioconductor) user, still I recently wondered how exactly Seurat defines markers:
What Seurat does is very simplistic. The FindMarkers()
function simply tests (by default with the Wilcox test) a given cluster versus a union of the cells of all other clusters. Meaning, if you have three clusters it for example tests 1
vs c(2,3)
. For a proof of that see code chunk below. The consequence is that markers per cluster are just enriched in this cluster, but are not specific and there can be other clusters (especially small ones) that have notably higher expression of this gene.
My personal interpretation is that the function kind of guarantees to always find some sort of markers per cluster so analysis can go on, but to me if you really want specific markers you need a different sort of test. The obvious advantage here is that even clusters that do not really have unique gene expression (maybe because of a tight developmental continuum) will get some markers to work with. The disadvantage is that these markers are not unique and might not be sufficient to discriminate the cells versus other clusters, for example using software like UCell or SingleR.
I personally like to test all pairwise comparisons between clusters and then do some sort of filtering, for example "a gene must be overexpressed with these thresholds in my clusters versus all/allButOne/80%/some other clusters". If you just need some markers to get an idea what your cells per cluster belong to in terms of celltype then the FindMarkers()
function might be sufficient. But for definition of transcriptional signatures that really define a group of cells it is (imo) not suitable, especially not if you have many clusters and expression patterns are diverse.
Does that answer your question?
Here the mentioned code chunk to demonstrate that Seurat really just tests one cluster versus the union of the rest:
library(Seurat)
data("pbmc_small")
Idents(pbmc_small)
head(FindMarkers(object = pbmc_small, ident.1 = 0), 5)
Idents(pbmc_small)[Idents(pbmc_small)==2] <- 1
Idents(pbmc_small)
head(FindMarkers(object = pbmc_small, ident.1 = 0), 5)
Created on 2023-12-13 with reprex v2.0.2
Another example:
library(Seurat)
ncol <- 2000
nrow <- 100
nclusters <- 10
set.seed(1)
data <- matrix(rnorm(ncol*nrow, 100, 10), byrow=TRUE, nrow=nrow, ncol=ncol)
colnames(data) <- paste0("sample-", 1:ncol(data))
rownames(data) <- paste0("gene-", 1:nrow(data))
clusters <- sample(1:nclusters, ncol(data), replace = TRUE)
seurat <- CreateSeuratObject(counts = data)
seurat <- NormalizeData(seurat)
Idents(seurat) <- clusters
levels(Idents(seurat))
markers_a <- FindMarkers(seurat, ident.1 = "1", logfc.threshold = 0)
Idents(seurat)[as.numeric(Idents(seurat)) > 1] <- "2"
levels(Idents(seurat))
markers_b <- FindMarkers(seurat, ident.1 = "1", logfc.threshold = 0)
plot(markers_a$avg_log2FC, markers_b$avg_log2FC)
Thanks ATpoint this is very helpful for my understanding. My goal for this analysis is to just characterize the cells that belong to a cluster. My reason for doing it this way than a reference mapping approach is to see if there are top markers from a functional category that are specific to a cluster. I may have to do this using a cell signature approach than just looking at few top markers.