One would usally use one of several packages specifically designed for testing differential expression between conditions - DESeq2, edgeR and limma are the most common choices, all three are easy to use and have very good documentation. DESeq2 by default uses a Wald test on the model coefficients, but can use an LRT on a negative binomial GLM. limma and edgeR use quasi-likelihood ratio tests (qLRTs) - limma on using an empircal bayes moderated normal GLM and edgeR using a negative bionomial GLM.
I would almost always use one of these purpose build tools for analysing RNAseq data, which not only deal with the test, but with a whole host of important things that go along with the analysis, such as the correct normalisation of the data for differential testing, and robust dispersion estimation.
You might find that if you have hundreds or thousands of replicates, that a Wilcoxon rank sum test will give you more or less simlar results, but the Wilcoxon test will almost always be less powerful.
One thing to bare in mind is that the GLMs assume that things follow a certain distribution (mostly a negative binomial). This will be true as long as the replicates are all sampled from the same underlying unimodel distribution, but strange things can happen in the presence of sub-populations. You might find genes selected as differentially expressed when they are only differential in a subset of the samples. This is, however, generally only a worry when you have 100s or 1000s of samples. In these cases, and these cases only, there is some discussion on whether it is better to use a latent variable identification algorythm (such as SVA or RUV) and then the negative binomial based techniques, or if it would be better to use a wilcoxon.
Thank you for your detailed answer. GLM will provide more test power and information such as dispersion if the criteria meet!
I have a similar quesion on single-cell level. I've seen people with two method for DE analysis: one is to fisrt apply pseudo bulk then use GLM, the other is to directly use Wilcoxon rank sum on single cell level. I assume the former is more strict, but is the latter applicable?
My understanding is that usually the wilcoxon and the pseduo bulk are addressing different questions. If you take cells in one cluster and compare them to cells in a different cluster, or cells from a different indevidual, you are saying "are these from this indeividual different to those cells from that indevidual". If the two clusters are from the same person, then the question becomes "Are cells of type X from indevidual A different from cells of type Y from indevidual A". Note that this would not necceessarily generalise to different indeviduals. If its the same cluster from two different indeviduals then you are asking "Are cells of type X in indevidual A different from cells of type X in indevidual B". Note that here you don't know if these it something particular to that cluster, or if its just that the two indeivudals are different. You certainly can't attribute it to any intervenion (i.e. treated or untreated). If you have multiple invideuals, with multiple cells from each indeivual, then you need a multi-level model. You absolutely can't take 100 cells from A, 100 cells from B and 100 cells from C, and combine those to become 300 cells used in a wilcoxon. That would massively overestimate your sample size (which would be 3, not 300).
Pseudobulking asks a different question - if you take 3 indeivuals each from two conditions, do single cell, cluster and select the same cluster from each, and then psuedobulk you are asking "Is the average expression from cell type X in condition A different to the average expression of the same cell type in condition B".
In summary - doing wilcoxon on single cells takes into account cell to cell variation, but ignores indevidual to indevidual varition, whilst pseduo-bulk takes into account inter-indevidual variation, but ignores cell to cell variation. For most questions we are interested in, the latter is more appropriate than the former.
Strongly agreed. It also comes to me that it's normal for us to merge single cells from many people to do clustering. It's also impercise though, because you can never tell the difference between clusters are from individual-level variation or single cell-level (You can visualize batch distribution to combat this problem). But for two-condition analysis, many articles are using directly Wilcoxon on sc-level (which both have many individuals), and this may give inflated p-value and false-positive results.
the latter, i.e., using Wilcoxon rank sum, is applicable, but the recommendation is to use pseudobulk when possible. It's important to know that the significance of the wilcoxon rank sum test are very dependent on the number of cells in the comparisons and will likely result in over-exaggerated p-values - to further refine your results you can apply filters on the % of cells expressing the gene in one or both of the cell groups that you are comparing.
Thanks! Could you please describe more detail on this?
You mean filter out whole group if it doesn't meet % expression? (like don't include a group of cells into comparsion if <40% of the cells with no expression) Why does this method help to get better result? It seems not to solve the essentials problem that "single cells are not statistically independent".
No, I mean filtering genes, not cells, based on the percent of cells in a given group or cluster that do or don't express that gene. By removing genes only expressed in a small fraction of cells either before or after applying the Wilcoxon rank sum you will likely reduce exaggerated false positives.
Thank you jv! I'll apply your method when no pseudo-bulk sample can be created!