Entering edit mode
10 weeks ago
AB
▴
360
I have a dataset with 3 controls and 6 disease samples. Would it be appropriate to pseudobulk while doing the differential expression analysis ?
Even if I do FindMarkers using seurat at single cell level instead of pseudobulk, how can i determine if the differences in expression is not because of differences in cell proportion ? What would be the right approach in this case ?
Thanks
From your stated experiment layout I don't think it would take much time to try both pseudobulk as well as FindMarkers() / FindAllMarkers(). You can worry about robustness and the 'right approach' but you can also run different approaches and see what they give you. The idea is to get to know the data in the first instance as long as you're clear about what you've done and honest about the result. Especially when you're trying to understand the biology of the cells you're looking at.
I assume you have done some kind of clustering - either graphical clustering using Leiden, or NMF, or something else. You can calculate the average expression of a gene across clusters as well as the number of cells expressing that gene above some threshold. That will help you gauge whether that gene is a good marker of the cluster. If a gene is flagged as significantly differentially expressed, but it only comes up in something like 20% of cells in the cluster then maybe it's not a good marker.
Thank you. Is there a specific test in FindMarkers that might be better suited for this ? I've always run the default wilcoxon. Maybe MAST or DESeq ?
Again, I would worry about this less while you're still just exploring the data. From reading some benchmarking papers, I think MAST is a bit more robust, but also it takes longer to run so in the first instance you can run wilcoxon and see what you get.
It is known that the pseudoreplication issue and noise of single-cell level differential analysis hurts the robustness of the statistical inference. I would always pseudobulk if I had the chance.
Pseudobulk sums up counts for each celltype. So in my dataset, I have a cell population that has about ~600 cells in the healthy samples and 2000 in the diseased. So that's ~600 cells from 3 samples and ~2000 cells from 6 disease samples. how can i account for this variation in cell proportion ? Are the genes that are showing up as DE in diseased doing so because I have more disease samples ? In that case, should i downsample before doing a DEA ?
You can always subsample.
As mentionned you can always subsample.
Pseudobulk differential gene expression methods will take care of the sample size by adjusting a size factor for each sample.
This guide is nicely written if you need some more inputs.
Still, clusters with few cells will suffer from dropouts that scaling can not recover, so exploring the effect of subsampling is always recommended in my experience.
True, it would be misleading to scale up low counts. If the number of cells in one cluster is too low to be trusted because of dropouts, I prefer removing the entire cluster from that sample. However, I don't see how subsampling all clusters to the lowest cluster in cell counts would help in the DEA. Maybe one can bootstrap and look at the variability of your output DE genes.