Question

Confusing results from FindConservedMarkers in Seurat

0

Entering edit mode

12 months ago

pm2012 ▴ 140

I am trying to identify marker genes from clusters identified by running Seurat. I use FindConservedMarkers for this as I have multiple conditions in my dataset and I use grouping.var as my multiple conditions. However, when I look at the top 5 marker genes based on Avglog2FC, there are a number of markers which shows up as markers of multiple clusters. I don't know how to explain these results. Has anyone come across similar results? How do you interpret/make sense of this?

Here's my experiment design:

5 control samples: 2 time points
6 treated samples: 3 time points

I have integrated all the samples to identify clusters.

10XGenomics Seurat • 2.6k views

ADD COMMENT • link 12 months ago by pm2012 ▴ 140

1

Entering edit mode

12 months ago

yura.grabovska ▴ 690

By definition FindConservedMarkers() "Finds markers that are conserved between the groups".

If you want to test for differential expression to find gene markers defining a cluster you want to use FindMarkers() / FindAllMarkers()

ADD COMMENT • link 12 months ago by yura.grabovska ▴ 690

0

Entering edit mode

Thanks. My understanding is FindConservedMarkers() more appropriate if you have multiple groups/conditions and you want to ensure the markers are conserved across conditions within a cluster. This is based on the reference thread below. FindConservedMarkers vs FindMarkers vs FindAllMarkers Seurat

ADD REPLY • link 12 months ago by pm2012 ▴ 140

score 5 · Accepted Answer · 2023-12-13

Take my comment with a grain of salt because I am not an active Seurat (but Bioconductor) user, still I recently wondered how exactly Seurat defines markers:

What Seurat does is very simplistic. The FindMarkers() function simply tests (by default with the Wilcox test) a given cluster versus a union of the cells of all other clusters. Meaning, if you have three clusters it for example tests 1 vs c(2,3). For a proof of that see code chunk below. The consequence is that markers per cluster are just enriched in this cluster, but are not specific and there can be other clusters (especially small ones) that have notably higher expression of this gene.

My personal interpretation is that the function kind of guarantees to always find some sort of markers per cluster so analysis can go on, but to me if you really want specific markers you need a different sort of test. The obvious advantage here is that even clusters that do not really have unique gene expression (maybe because of a tight developmental continuum) will get some markers to work with. The disadvantage is that these markers are not unique and might not be sufficient to discriminate the cells versus other clusters, for example using software like UCell or SingleR.

I personally like to test all pairwise comparisons between clusters and then do some sort of filtering, for example "a gene must be overexpressed with these thresholds in my clusters versus all/allButOne/80%/some other clusters". If you just need some markers to get an idea what your cells per cluster belong to in terms of celltype then the FindMarkers() function might be sufficient. But for definition of transcriptional signatures that really define a group of cells it is (imo) not suitable, especially not if you have many clusters and expression patterns are diverse.

Does that answer your question?

Here the mentioned code chunk to demonstrate that Seurat really just tests one cluster versus the union of the rest:

library(Seurat)
#> Warning: Paket 'Seurat' wurde unter R Version 4.3.1 erstellt
#> The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
#> which was just loaded, will retire in October 2023.
#> Please refer to R-spatial evolution reports for details, especially
#> https://r-spatial.org/r/2023/05/15/evolution4.html.
#> It may be desirable to make the sf package available;
#> package maintainers should consider adding sf to Suggests:.
#> The sp package is now running under evolution status 2
#>      (status 2 uses the sf package in place of rgdal)
#> Attaching SeuratObject

data("pbmc_small")

# We have three clusters
Idents(pbmc_small)
#> ATGCCAGAACGACT CATGGCCTGTGCAT GAACCTGATGAACC TGACTGGATTCTCA AGTCAGACTGCACA 
#>              0              0              0              0              0 
#> TCTGATACACGTGT TGGTATCTAAACAG GCAGCTCTGTTTCT GATATAACACGCAT AATGTTGACAGTCA 
#>              0              0              0              0              0 
#> AGGTCATGAGTGTC AGAGATGATCTCGC GGGTAACTCTAGTG CATGAGACACGGGA TACGCCACTCCGAA 
#>              2              2              2              2              2 
#> CTAAACCTGTGCAT GTAAGCACTCATTC TTGGTACTGAATCC CATCATACGGAGCA TACATCACGCTAAC 
#>              2              2              2              2              2 
#> TTACCATGAATCGC ATAGGAGAAACAGA GCGCACGACTTTAC ACTCGCACGAAAGT ATTACCTGCCTTAT 
#>              1              1              1              1              1 
#> CCCAACTGCAATCG AAATTCGAATCACG CCATCCGATTCGCC TCCACTCTGAGCTT CATCAGGATGCACA 
#>              1              1              1              1              1 
#> CTAAACCTCTGACA GATAGAGAAGGGTG CTAACGGAACCGAT AGATATACCCGTAA TACTCTGAATCGAC 
#>              0              2              0              0              0 
#> GCGCATCTTGCTCC GTTGACGATATCGG ACAGGTACTGGTGT GGCATATGCTTATC CATTACACCAACTG 
#>              0              0              0              0              0 
#> TAGGGACTGAACTC GCTCCATGAGAAGT TACAATGATGCTAG CTTCATGACCGAAT CTGCCAACAGGAGC 
#>              0              2              0              0              2 
#> TTGCATTGAGCTAC AAGCAAGAGCTTAG CGGCACGAACTCAG GGTGGAGATTACTC GGCCGATGTACTCT 
#>              2              0              0              0              0 
#> CGTAGCCTGTATGC TGAGCTGAATGCTG CCTATAACGAGACG ATAAGTTGGTACGT AAGCGACTTTGACG 
#>              1              1              2              1              1 
#> ACCAGTGAATACCG ATTGCACTTGCTTT CTAGGTGATGGTTG GCACTAGACCTTTA CATGCGCTAGTCAC 
#>              1              1              1              1              0 
#> TTGAGGACTACGCA ATACCACTCTAAGC CATATAGACTAAGC TTTAGCTGTACTCT GACATTCTCCACCT 
#>              2              1              2              1              2 
#> ACGTGATGCCATGA ATTGTAGATTCCCG GATAGAGATCACGA AATGCGTGGACGGA GCGTAAACACGGTT 
#>              1              1              1              1              2 
#> ATTCAGCTCATTGG GGCATATGGGGAGT ATCATCTGACACCA GTCATACTTCGCCT TTACGTACGTTCAG 
#>              0              0              0              0              0 
#> GAGTTGTGGTAGCT GACGCTCTCTCTCG AGTCTTACTTCGGA GGAACACTTCAGAC CTTGATTGATCTTC 
#>              0              0              0              0              1 
#> Levels: 0 1 2

# Get markers for "0"
head(FindMarkers(object = pbmc_small, ident.1 = 0), 5)
#>                 p_val avg_log2FC pct.1 pct.2    p_val_adj
#> HLA-DPB1 9.572778e-13  -5.820829 0.083 0.909 2.201739e-10
#> HLA-DRB1 7.673127e-12  -5.425935 0.083 0.864 1.764819e-09
#> HLA-DPA1 3.673172e-11  -4.374436 0.111 0.864 8.448296e-09
#> HLA-DRA  1.209114e-10  -4.263126 0.417 0.909 2.780962e-08
#> HLA-DRB5 9.547049e-10  -4.356374 0.056 0.773 2.195821e-07

# Change Idents to only have two groups, but results are exactly the same as above, demonstrating that the test goes 0 versus all other cells
Idents(pbmc_small)[Idents(pbmc_small)==2] <- 1
Idents(pbmc_small)
#> ATGCCAGAACGACT CATGGCCTGTGCAT GAACCTGATGAACC TGACTGGATTCTCA AGTCAGACTGCACA 
#>              0              0              0              0              0 
#> TCTGATACACGTGT TGGTATCTAAACAG GCAGCTCTGTTTCT GATATAACACGCAT AATGTTGACAGTCA 
#>              0              0              0              0              0 
#> AGGTCATGAGTGTC AGAGATGATCTCGC GGGTAACTCTAGTG CATGAGACACGGGA TACGCCACTCCGAA 
#>              1              1              1              1              1 
#> CTAAACCTGTGCAT GTAAGCACTCATTC TTGGTACTGAATCC CATCATACGGAGCA TACATCACGCTAAC 
#>              1              1              1              1              1 
#> TTACCATGAATCGC ATAGGAGAAACAGA GCGCACGACTTTAC ACTCGCACGAAAGT ATTACCTGCCTTAT 
#>              1              1              1              1              1 
#> CCCAACTGCAATCG AAATTCGAATCACG CCATCCGATTCGCC TCCACTCTGAGCTT CATCAGGATGCACA 
#>              1              1              1              1              1 
#> CTAAACCTCTGACA GATAGAGAAGGGTG CTAACGGAACCGAT AGATATACCCGTAA TACTCTGAATCGAC 
#>              0              1              0              0              0 
#> GCGCATCTTGCTCC GTTGACGATATCGG ACAGGTACTGGTGT GGCATATGCTTATC CATTACACCAACTG 
#>              0              0              0              0              0 
#> TAGGGACTGAACTC GCTCCATGAGAAGT TACAATGATGCTAG CTTCATGACCGAAT CTGCCAACAGGAGC 
#>              0              1              0              0              1 
#> TTGCATTGAGCTAC AAGCAAGAGCTTAG CGGCACGAACTCAG GGTGGAGATTACTC GGCCGATGTACTCT 
#>              1              0              0              0              0 
#> CGTAGCCTGTATGC TGAGCTGAATGCTG CCTATAACGAGACG ATAAGTTGGTACGT AAGCGACTTTGACG 
#>              1              1              1              1              1 
#> ACCAGTGAATACCG ATTGCACTTGCTTT CTAGGTGATGGTTG GCACTAGACCTTTA CATGCGCTAGTCAC 
#>              1              1              1              1              0 
#> TTGAGGACTACGCA ATACCACTCTAAGC CATATAGACTAAGC TTTAGCTGTACTCT GACATTCTCCACCT 
#>              1              1              1              1              1 
#> ACGTGATGCCATGA ATTGTAGATTCCCG GATAGAGATCACGA AATGCGTGGACGGA GCGTAAACACGGTT 
#>              1              1              1              1              1 
#> ATTCAGCTCATTGG GGCATATGGGGAGT ATCATCTGACACCA GTCATACTTCGCCT TTACGTACGTTCAG 
#>              0              0              0              0              0 
#> GAGTTGTGGTAGCT GACGCTCTCTCTCG AGTCTTACTTCGGA GGAACACTTCAGAC CTTGATTGATCTTC 
#>              0              0              0              0              1 
#> Levels: 0 1

head(FindMarkers(object = pbmc_small, ident.1 = 0), 5)
#>                 p_val avg_log2FC pct.1 pct.2    p_val_adj
#> HLA-DPB1 9.572778e-13  -5.820829 0.083 0.909 2.201739e-10
#> HLA-DRB1 7.673127e-12  -5.425935 0.083 0.864 1.764819e-09
#> HLA-DPA1 3.673172e-11  -4.374436 0.111 0.864 8.448296e-09
#> HLA-DRA  1.209114e-10  -4.263126 0.417 0.909 2.780962e-08
#> HLA-DRB5 9.547049e-10  -4.356374 0.056 0.773 2.195821e-07
Created on 2023-12-13 with reprex v2.0.2

Another example:

library(Seurat)

# Some mock data with 10 clusters
ncol <- 2000
nrow <- 100
nclusters <- 10

set.seed(1)
data <- matrix(rnorm(ncol*nrow, 100, 10), byrow=TRUE, nrow=nrow, ncol=ncol)
colnames(data) <- paste0("sample-", 1:ncol(data))
rownames(data) <- paste0("gene-", 1:nrow(data))
clusters <- sample(1:nclusters, ncol(data), replace = TRUE)

seurat <- CreateSeuratObject(counts = data)
seurat <- NormalizeData(seurat)
Idents(seurat) <- clusters
levels(Idents(seurat))

markers_a <- FindMarkers(seurat, ident.1 = "1", logfc.threshold = 0)

# Make it two clusters (1 and 2)
Idents(seurat)[as.numeric(Idents(seurat)) > 1] <- "2"
levels(Idents(seurat))
markers_b <- FindMarkers(seurat, ident.1 = "1", logfc.threshold = 0)

# Perfect agreement
plot(markers_a$avg_log2FC, markers_b$avg_log2FC)