Question

Finding high number of significant genes by FindMarkers by Seurat

0

Entering edit mode

22 days ago

Rosa Icela • 0

Hello,

I am new at analyzing Single-cell data. I am currently working on a two group comparison with 4 biological replicates each. I followed the standard protocol. NormalizeData, FIndVariableFeatures(2000), FindIntegrationAnchors, IntegrateData, ScaleData, RunPCA, RunUMAP, FIndNeighbors, and FindClusters. After defining my clusters, I run FindMarkers with default settings. The output is where I find myself lost. I have filtered Average log2FC by >0.25 or <-0.25, min.pct of 0.25 and padj<0.00001. I still have 14000 genes to be significantly different between groups. How can I approach that number? is it usual to see a high number? am I missing something in my analysis? I would appreciate any suggestions and advice on this. Thank you.

FindMarkers Single-cell Seurat • 305 views

ADD COMMENT • link updated 22 days ago by jared.andrews07 ★ 18k • written 22 days ago by Rosa Icela • 0

score 0 · Answer 1 · 2025-01-29

0

Entering edit mode

22 days ago

jared.andrews07 ★ 18k

The problem of p-value deflation in scRNA-seq DE methods has been well documented. By treating each cell as a biological replicate, you get miniscule p-values using naive single cell DE methods despite negligible effect sizes, but they're not independent replicates - they're correlated within each individual (total) sample. This results in very high false positive rates for most of the single cell-specific DE methods.

As such, pseudobulking single cell populations by celltype/cluster per individual and comparing between conditions has been shown to perform better, and this additionally allows for handling of batch effects between individuals.

This section of the Single-Cell Best Practices book has a nice overview with lots of references and some examples. The OSCA book also has a section on this with examples that I find easier to follow as it doesn't mix R and python.

So to answer your question, yes, this is a common problem, and you should adjust your analysis to use more robust approaches for DE.

ADD COMMENT • link 22 days ago by jared.andrews07 ★ 18k

0

Entering edit mode

Thank you for the useful information, I will dig deeper on it.

I've tried pseudobulk analysis using DESEq2 and one of my clusters has only 44 DEGs (which seems fine) but the other one only 1 using a padj<0.05. The adjusted values increase rapidly to 0.998. I do take the batch inconsideration but only seems to increase the number of genes by 5-10. Do you have any suggestion? Thank you again for your input.

ADD REPLY • link 22 days ago by Rosa Icela • 0

0

Entering edit mode

Nothing concrete, unfortunately. You can try some other methods (limma, edgeR), but you'll likely get pretty similar results. I'd try dropping the significance threshold to 0.1.

I have also noticed that very small clusters tend to have poor sensitivity, likely due to higher variability among the pseudobulks. So for clusters with <20 cells per sample, you may have a tough time.

ADD REPLY • link 22 days ago by jared.andrews07 ★ 18k