Question

Cluster annotation in single cell

0

Entering edit mode

2.2 years ago

synat.keam ▴ 120

Dear Fellows,

In Single cell, once we perform a clustering, for example, "umap", which generate X number of clusters. Next is to perform annotation for cluster, which can be done by looking at differentially expressed genes within each cluster. if we get DEG within each cluster, are these DEGs the result of multiple cell comparison? I remember in bulk-RNA seq, we can only do two groups at a time using contrast? Not sure how do they compared to get DEG among several hundred cells in a cluster for single cell experiment?

Also, with integration of large dataset, the main purpose is batch correction etc. In the end, we get a single umap plot, which is the result of integration of all number of samples and conditions (control/treatment etc) from all groups. Does the display of a single "umap" mean that these cell clusters are found across samples and conditions? How could I know from a single umap that this/that group has less, for instance, fibroblast or T cell given I have cluster with with fibroblast or T cells etc. What is the point of displaying a single umap of all data set (I normally see this in publication)? Sorry I am just very confused... Looking to hear from you all.

Thanks,

Single-cell • 3.5k views

ADD COMMENT • link updated 7 months ago by Ethan • 0 • written 2.2 years ago by synat.keam ▴ 120

2

Entering edit mode

You need to do through these tutorials which will help you a lot.

Single-cell best practices

OSCA

ADD REPLY • link 2.2 years ago by bk11 ★ 3.1k

1

Entering edit mode

I'd really recommend finding a local scRNA-seq expert to talk to at your institution if available. These questions are really beyond the scope of this site and will require lengthy and detailed answers.

ADD REPLY • link 2.2 years ago by jared.andrews07 ★ 19k

0

Entering edit mode

Are you using Seurat?

ADD REPLY • link 2.2 years ago by Ram 45k

0

Entering edit mode

Thanks, I'm using seurat and also tried to learn from Bioconductor book. could you help explain me. I am just very confused and did not progress at all

Regards,

ADD REPLY • link 2.2 years ago by synat.keam ▴ 120

0

Entering edit mode

I asked that question because it's relevant to your post. I cannot guide you on such a broad topic. Use the links bk11 has provided you to learn more.

ADD REPLY • link 2.2 years ago by Ram 45k

score 3 · Answer 1 · 2023-10-17

Clustering (for example in Seurat's pipeline) is usually done based on PCA embedding, not UMAP, as the former conserves the euclidian distances between the cells in the multidimensional expression space and the latter is somewhat stochastic by definition.

The DEGs can be found with a very nice package called presto and as an added benefit it doesn't assume any distribution of your data as it uses nonparametric (i.e. rank-based) statistical testing.

score 0 · Answer 2 · 2025-04-26

I'd like to recommend a new tool called mLLMCelltype that can greatly simplify the cluster annotation process for single-cell RNA-seq data.

mLLMCelltype is a cell type annotation framework based on large language models (LLMs) that leverages the collective intelligence of multiple LLMs (such as Claude 3.7, GPT-4o, Gemini 2.5 Pro, etc.) to provide accurate cell type annotations without requiring you to manually analyze differentially expressed genes for each cluster.

Why mLLMCelltype Solves Your Problems

Automated Annotation Process: You don't need to manually analyze DEGs for each cluster; mLLMCelltype handles this process automatically.
Multi-model Consensus Mechanism: By leveraging the collective intelligence of multiple LLMs, it reduces biases and hallucinations from single models, improving annotation accuracy.
Transparent Uncertainty Quantification: Provides quantitative metrics (Consensus Proportion and Shannon Entropy) to help identify ambiguous cell populations that may require expert review.
No Reference Dataset Required: Works without pre-training or reference data, directly annotating based on differentially expressed genes.
Complete Reasoning Chains: Documents the full deliberation process for transparent decision-making.
Seamless Integration with Seurat: Works directly with your existing Seurat workflows.

Usage Example

library(mLLMCelltype)
library(Seurat)
library(dplyr)

# Assuming you already have a preprocessed Seurat object
# pbmc <- readRDS("your_seurat_object.rds")

# Find marker genes for each cluster
pbmc_markers <- FindAllMarkers(pbmc,
                           only.pos = TRUE,
                           min.pct = 0.25,
                           logfc.threshold = 0.25)

# Set up cache directory to speed up processing
cache_dir <- "./mllmcelltype_cache"
dir.create(cache_dir, showWarnings = FALSE, recursive = TRUE)

# Run LLMCelltype annotation with multiple LLM models
consensus_results <- interactive_consensus_annotation(
  input = pbmc_markers,
  tissue_name = "human PBMC",  # provide tissue context
  models = c(
    "claude-3-7-sonnet-20250219",  # Anthropic
    "gpt-4o",                   # OpenAI
    "gemini-2.5-pro"            # Google
  ),
  api_keys = list(
    anthropic = Sys.getenv("ANTHROPIC_API_KEY"),
    openai = Sys.getenv("OPENAI_API_KEY"),
    gemini = Sys.getenv("GOOGLE_API_KEY")
  ),
  top_gene_count = 10,
  controversy_threshold = 1.0,
  entropy_threshold = 1.0,
  cache_dir = cache_dir
)

# Add annotations to Seurat object
cluster_to_celltype_map <- consensus_results$final_annotations

# Create new cell type identifier column
cell_types <- as.character(Idents(pbmc))
for (cluster_id in names(cluster_to_celltype_map)) {
  cell_types[cell_types == cluster_id] <- cluster_to_celltype_map[[cluster_id]]
}

# Add cell types to Seurat object
pbmc$mLLM_cell_type <- cell_types

# Visualize results
DimPlot(pbmc, group.by = "mLLM_cell_type", label = TRUE) +
  ggtitle("mLLMCelltype Consensus Annotations")

Regarding Your UMAP Integration Questions

Regarding your UMAP integration questions, mLLMCelltype can help you perform deeper analysis after annotation:

You can use DimPlot(pbmc, group.by = "mLLM_cell_type", split.by = "condition") to view cell type distributions across different conditions.
Use table(pbmc$mLLM_cell_type, pbmc$sample) or table(pbmc$mLLM_cell_type, pbmc$condition) to quantify the number of each cell type across different samples or conditions.
mLLMCelltype's uncertainty quantification features can help you identify cell populations that might differ between batches or conditions.

Resources

I hope this tool helps solve your single-cell annotation challenges!

score 0 · Answer 3 · 2025-04-28

In single-cell RNA-seq, after clustering, when you perform DEG analysis within a cluster, you typically compare the cells in that cluster against all other cells (or sometimes against specific clusters). Yes, the DEG you obtain is based on multiple cell comparisons, for example, comparing hundreds of cells in Cluster 1 to all cells not in Cluster 1.

The statistical tests are adapted for single-cell data, accounting for variability across cells and often using models such as Wilcoxon rank-sum tests, negative binomial models, or other single-cell-specific methods (like those in Seurat or Scanpy).