Finding high number of significant genes by FindMarkers by Seurat
1
0
Entering edit mode
22 days ago
Rosa Icela • 0

Hello,

I am new at analyzing Single-cell data. I am currently working on a two group comparison with 4 biological replicates each. I followed the standard protocol. NormalizeData, FIndVariableFeatures(2000), FindIntegrationAnchors, IntegrateData, ScaleData, RunPCA, RunUMAP, FIndNeighbors, and FindClusters. After defining my clusters, I run FindMarkers with default settings. The output is where I find myself lost. I have filtered Average log2FC by >0.25 or <-0.25, min.pct of 0.25 and padj<0.00001. I still have 14000 genes to be significantly different between groups. How can I approach that number? is it usual to see a high number? am I missing something in my analysis? I would appreciate any suggestions and advice on this. Thank you.

FindMarkers Single-cell Seurat • 305 views
ADD COMMENT
0
Entering edit mode
22 days ago

The problem of p-value deflation in scRNA-seq DE methods has been well documented. By treating each cell as a biological replicate, you get miniscule p-values using naive single cell DE methods despite negligible effect sizes, but they're not independent replicates - they're correlated within each individual (total) sample. This results in very high false positive rates for most of the single cell-specific DE methods.

As such, pseudobulking single cell populations by celltype/cluster per individual and comparing between conditions has been shown to perform better, and this additionally allows for handling of batch effects between individuals.

This section of the Single-Cell Best Practices book has a nice overview with lots of references and some examples. The OSCA book also has a section on this with examples that I find easier to follow as it doesn't mix R and python.

So to answer your question, yes, this is a common problem, and you should adjust your analysis to use more robust approaches for DE.

ADD COMMENT
0
Entering edit mode

Thank you for the useful information, I will dig deeper on it.

I've tried pseudobulk analysis using DESEq2 and one of my clusters has only 44 DEGs (which seems fine) but the other one only 1 using a padj<0.05. The adjusted values increase rapidly to 0.998. I do take the batch inconsideration but only seems to increase the number of genes by 5-10. Do you have any suggestion? Thank you again for your input.

ADD REPLY
0
Entering edit mode

Nothing concrete, unfortunately. You can try some other methods (limma, edgeR), but you'll likely get pretty similar results. I'd try dropping the significance threshold to 0.1.

I have also noticed that very small clusters tend to have poor sensitivity, likely due to higher variability among the pseudobulks. So for clusters with <20 cells per sample, you may have a tough time.

ADD REPLY

Login before adding your answer.

Traffic: 2400 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6