Take my comment with a grain of salt because I am not an active Seurat (but Bioconductor) user, still I recently wondered how exactly Seurat defines markers:
What Seurat does is very simplistic. The FindMarkers()
function simply tests (by default with the Wilcox test) a given cluster versus a union of the cells of all other clusters. Meaning, if you have three clusters it for example tests 1
vs c(2,3)
. For a proof of that see code chunk below. The consequence is that markers per cluster are just enriched in this cluster, but are not specific and there can be other clusters (especially small ones) that have notably higher expression of this gene.
My personal interpretation is that the function kind of guarantees to always find some sort of markers per cluster so analysis can go on, but to me if you really want specific markers you need a different sort of test. The obvious advantage here is that even clusters that do not really have unique gene expression (maybe because of a tight developmental continuum) will get some markers to work with. The disadvantage is that these markers are not unique and might not be sufficient to discriminate the cells versus other clusters, for example using software like UCell or SingleR.
I personally like to test all pairwise comparisons between clusters and then do some sort of filtering, for example "a gene must be overexpressed with these thresholds in my clusters versus all/allButOne/80%/some other clusters". If you just need some markers to get an idea what your cells per cluster belong to in terms of celltype then the FindMarkers()
function might be sufficient. But for definition of transcriptional signatures that really define a group of cells it is (imo) not suitable, especially not if you have many clusters and expression patterns are diverse.
Does that answer your question?
Here the mentioned code chunk to demonstrate that Seurat really just tests one cluster versus the union of the rest:
library(Seurat)
#> Warning: Paket 'Seurat' wurde unter R Version 4.3.1 erstellt
#> The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
#> which was just loaded, will retire in October 2023.
#> Please refer to R-spatial evolution reports for details, especially
#> https://r-spatial.org/r/2023/05/15/evolution4.html.
#> It may be desirable to make the sf package available;
#> package maintainers should consider adding sf to Suggests:.
#> The sp package is now running under evolution status 2
#> (status 2 uses the sf package in place of rgdal)
#> Attaching SeuratObject
data("pbmc_small")
# We have three clusters
Idents(pbmc_small)
#> ATGCCAGAACGACT CATGGCCTGTGCAT GAACCTGATGAACC TGACTGGATTCTCA AGTCAGACTGCACA
#> 0 0 0 0 0
#> TCTGATACACGTGT TGGTATCTAAACAG GCAGCTCTGTTTCT GATATAACACGCAT AATGTTGACAGTCA
#> 0 0 0 0 0
#> AGGTCATGAGTGTC AGAGATGATCTCGC GGGTAACTCTAGTG CATGAGACACGGGA TACGCCACTCCGAA
#> 2 2 2 2 2
#> CTAAACCTGTGCAT GTAAGCACTCATTC TTGGTACTGAATCC CATCATACGGAGCA TACATCACGCTAAC
#> 2 2 2 2 2
#> TTACCATGAATCGC ATAGGAGAAACAGA GCGCACGACTTTAC ACTCGCACGAAAGT ATTACCTGCCTTAT
#> 1 1 1 1 1
#> CCCAACTGCAATCG AAATTCGAATCACG CCATCCGATTCGCC TCCACTCTGAGCTT CATCAGGATGCACA
#> 1 1 1 1 1
#> CTAAACCTCTGACA GATAGAGAAGGGTG CTAACGGAACCGAT AGATATACCCGTAA TACTCTGAATCGAC
#> 0 2 0 0 0
#> GCGCATCTTGCTCC GTTGACGATATCGG ACAGGTACTGGTGT GGCATATGCTTATC CATTACACCAACTG
#> 0 0 0 0 0
#> TAGGGACTGAACTC GCTCCATGAGAAGT TACAATGATGCTAG CTTCATGACCGAAT CTGCCAACAGGAGC
#> 0 2 0 0 2
#> TTGCATTGAGCTAC AAGCAAGAGCTTAG CGGCACGAACTCAG GGTGGAGATTACTC GGCCGATGTACTCT
#> 2 0 0 0 0
#> CGTAGCCTGTATGC TGAGCTGAATGCTG CCTATAACGAGACG ATAAGTTGGTACGT AAGCGACTTTGACG
#> 1 1 2 1 1
#> ACCAGTGAATACCG ATTGCACTTGCTTT CTAGGTGATGGTTG GCACTAGACCTTTA CATGCGCTAGTCAC
#> 1 1 1 1 0
#> TTGAGGACTACGCA ATACCACTCTAAGC CATATAGACTAAGC TTTAGCTGTACTCT GACATTCTCCACCT
#> 2 1 2 1 2
#> ACGTGATGCCATGA ATTGTAGATTCCCG GATAGAGATCACGA AATGCGTGGACGGA GCGTAAACACGGTT
#> 1 1 1 1 2
#> ATTCAGCTCATTGG GGCATATGGGGAGT ATCATCTGACACCA GTCATACTTCGCCT TTACGTACGTTCAG
#> 0 0 0 0 0
#> GAGTTGTGGTAGCT GACGCTCTCTCTCG AGTCTTACTTCGGA GGAACACTTCAGAC CTTGATTGATCTTC
#> 0 0 0 0 1
#> Levels: 0 1 2
# Get markers for "0"
head(FindMarkers(object = pbmc_small, ident.1 = 0), 5)
#> p_val avg_log2FC pct.1 pct.2 p_val_adj
#> HLA-DPB1 9.572778e-13 -5.820829 0.083 0.909 2.201739e-10
#> HLA-DRB1 7.673127e-12 -5.425935 0.083 0.864 1.764819e-09
#> HLA-DPA1 3.673172e-11 -4.374436 0.111 0.864 8.448296e-09
#> HLA-DRA 1.209114e-10 -4.263126 0.417 0.909 2.780962e-08
#> HLA-DRB5 9.547049e-10 -4.356374 0.056 0.773 2.195821e-07
# Change Idents to only have two groups, but results are exactly the same as above, demonstrating that the test goes 0 versus all other cells
Idents(pbmc_small)[Idents(pbmc_small)==2] <- 1
Idents(pbmc_small)
#> ATGCCAGAACGACT CATGGCCTGTGCAT GAACCTGATGAACC TGACTGGATTCTCA AGTCAGACTGCACA
#> 0 0 0 0 0
#> TCTGATACACGTGT TGGTATCTAAACAG GCAGCTCTGTTTCT GATATAACACGCAT AATGTTGACAGTCA
#> 0 0 0 0 0
#> AGGTCATGAGTGTC AGAGATGATCTCGC GGGTAACTCTAGTG CATGAGACACGGGA TACGCCACTCCGAA
#> 1 1 1 1 1
#> CTAAACCTGTGCAT GTAAGCACTCATTC TTGGTACTGAATCC CATCATACGGAGCA TACATCACGCTAAC
#> 1 1 1 1 1
#> TTACCATGAATCGC ATAGGAGAAACAGA GCGCACGACTTTAC ACTCGCACGAAAGT ATTACCTGCCTTAT
#> 1 1 1 1 1
#> CCCAACTGCAATCG AAATTCGAATCACG CCATCCGATTCGCC TCCACTCTGAGCTT CATCAGGATGCACA
#> 1 1 1 1 1
#> CTAAACCTCTGACA GATAGAGAAGGGTG CTAACGGAACCGAT AGATATACCCGTAA TACTCTGAATCGAC
#> 0 1 0 0 0
#> GCGCATCTTGCTCC GTTGACGATATCGG ACAGGTACTGGTGT GGCATATGCTTATC CATTACACCAACTG
#> 0 0 0 0 0
#> TAGGGACTGAACTC GCTCCATGAGAAGT TACAATGATGCTAG CTTCATGACCGAAT CTGCCAACAGGAGC
#> 0 1 0 0 1
#> TTGCATTGAGCTAC AAGCAAGAGCTTAG CGGCACGAACTCAG GGTGGAGATTACTC GGCCGATGTACTCT
#> 1 0 0 0 0
#> CGTAGCCTGTATGC TGAGCTGAATGCTG CCTATAACGAGACG ATAAGTTGGTACGT AAGCGACTTTGACG
#> 1 1 1 1 1
#> ACCAGTGAATACCG ATTGCACTTGCTTT CTAGGTGATGGTTG GCACTAGACCTTTA CATGCGCTAGTCAC
#> 1 1 1 1 0
#> TTGAGGACTACGCA ATACCACTCTAAGC CATATAGACTAAGC TTTAGCTGTACTCT GACATTCTCCACCT
#> 1 1 1 1 1
#> ACGTGATGCCATGA ATTGTAGATTCCCG GATAGAGATCACGA AATGCGTGGACGGA GCGTAAACACGGTT
#> 1 1 1 1 1
#> ATTCAGCTCATTGG GGCATATGGGGAGT ATCATCTGACACCA GTCATACTTCGCCT TTACGTACGTTCAG
#> 0 0 0 0 0
#> GAGTTGTGGTAGCT GACGCTCTCTCTCG AGTCTTACTTCGGA GGAACACTTCAGAC CTTGATTGATCTTC
#> 0 0 0 0 1
#> Levels: 0 1
head(FindMarkers(object = pbmc_small, ident.1 = 0), 5)
#> p_val avg_log2FC pct.1 pct.2 p_val_adj
#> HLA-DPB1 9.572778e-13 -5.820829 0.083 0.909 2.201739e-10
#> HLA-DRB1 7.673127e-12 -5.425935 0.083 0.864 1.764819e-09
#> HLA-DPA1 3.673172e-11 -4.374436 0.111 0.864 8.448296e-09
#> HLA-DRA 1.209114e-10 -4.263126 0.417 0.909 2.780962e-08
#> HLA-DRB5 9.547049e-10 -4.356374 0.056 0.773 2.195821e-07
Created on 2023-12-13 with reprex v2.0.2
Another example:
library(Seurat)
# Some mock data with 10 clusters
ncol <- 2000
nrow <- 100
nclusters <- 10
set.seed(1)
data <- matrix(rnorm(ncol*nrow, 100, 10), byrow=TRUE, nrow=nrow, ncol=ncol)
colnames(data) <- paste0("sample-", 1:ncol(data))
rownames(data) <- paste0("gene-", 1:nrow(data))
clusters <- sample(1:nclusters, ncol(data), replace = TRUE)
seurat <- CreateSeuratObject(counts = data)
seurat <- NormalizeData(seurat)
Idents(seurat) <- clusters
levels(Idents(seurat))
markers_a <- FindMarkers(seurat, ident.1 = "1", logfc.threshold = 0)
# Make it two clusters (1 and 2)
Idents(seurat)[as.numeric(Idents(seurat)) > 1] <- "2"
levels(Idents(seurat))
markers_b <- FindMarkers(seurat, ident.1 = "1", logfc.threshold = 0)
# Perfect agreement
plot(markers_a$avg_log2FC, markers_b$avg_log2FC)
Thanks ATpoint this is very helpful for my understanding. My goal for this analysis is to just characterize the cells that belong to a cluster. My reason for doing it this way than a reference mapping approach is to see if there are top markers from a functional category that are specific to a cluster. I may have to do this using a cell signature approach than just looking at few top markers.