Has anyone had issues with seeing large log2FC changes between conditions when using pseudobulk strategies (e.g., DESeq2, edgeR, voom-limma, etc.) after subsetting clusters and comparing two different conditions? Like log2FC of up to 7 or 8?
I feel like most of the single-cell papers I've read recently use, for example, Seurat's FindMarkers() function and the Wilcoxon Rank Sum test to generate their DGE lists and volcano plots, but my understanding is that pseudobulk methods are preferable instead (https://www.nature.com/articles/nmeth.4612).
My log2FC changes are more in the 0.5 to 3 range if I just use FindMarkers() and the Wilcoxon Rank Sum but my gene list and the log2FCs change significantly when I subset a cluster and use a pseudobulk approach. Briefly looking at the DGE list, some of the genes that jump out make biological sense. I'm just having a tough time believing the large effect sizes! Thanks in advance!
Yes, that can happen. A gene being off in one cluster and high in another. logFCs cannot really be compared between pseudobulk and singlecell-level analysis. See also where in the MA-plot the gene is. Is it far-left, then large logFCs may be a result of small counts despite the stats magic that these tools do. Singlecell has many zeros and then deflates or inflates the fold changes depending on how many zeros there are per gene and group. Pseudobulk ignores zeros. I would make sure that you set a filter to only look at genes expressed by a certain fraction of cells in at least one group, e.g. 10 or 20%, because it makes little sense to compare genes that are expressed by e g. 10 and 25 cells in clusters that have totals of 1000 and 2000 cells. Pseudobulk may still call changes in these significant. Single-cell DE is still a mess, be sure to set intuituve filters and interpret results with care and see if it is biologically meaningful.
Thanks - this may be the right thing. For now, I've just been filtering by the below (Filtering genes in scRNA-seq):
Any suggestions on modifying the script to filter for genes non-zero expression in a percentage of cells instead? I guess I'd know the number of cells and can figure out how many cells would make up, for example, 10% of the cells in the cluster and use that instead of >=1? Thanks.
Granted I had minimal exposure to scRNAseq, I don't see how FindMarkers() and Wilcoxon Rank Sum test can be compared to pseudobulk strategies since the two approaches answer different questions in a kind of mutually exclusive way.
With FindMarkers()/Wilcoxon test you (typically?) compare different clusters of cells from the same experiment to find genes "diagnostic" for one cluster or the other (marker genes in fact). You cannot use, or it would be very inefficient to use, pseudobulk because for each gene you would compare just one count vs another - i.e. you would do DGE without replicates and you would ignore the variation within clusters.
Conversely, if you want to compare expression between conditions then you have at least one sample per condition (hopefully you have more). In this case FindMarkers()/Wilcoxon test is inappropriate because you have cells within samples and and each cell is not an independent observation. Pseudobulk would do well instead even if you would still ignore variation within samples.
Am I getting this right...?
Thanks! I have multiple samples per condition for 2 conditions. I'm not using individual cells as samples. That's why I'm using the pseudobulk approach.