Let's imagine a single cell experiment in which we have 3 biological replicates, treated (TR) and untreated (UNT).
After all the necessary filtering and integration steps, we isolate a cluster of interest (cluster X), for which we want to test differential gene expression (DE) between TR and UNT. Ideally one would filter lowly expressed genes, aggregate counts within each replicate x protocol to generate pseudobulks, and use tried and tested NB methods such as the ones implemented in DESeq2 or edgeR to get DE between conditions.
At the same time, we have also tested for differential abundance (DA) between TR and UNT (using a statistical framework such as Milo), and we know that TR is more abundant than UNT in X - say 60% TR and 40% UNT, for the sake of argument.
My question is: how do you account for DA when doing DE? Is it even necessary?
I had thought of 2 ways but I am not convinced by any of them:
- downsampling (by random sampling without replacement) each replicate so the number of cells matches the smallest number, in this case so that TR and UNT have the same number of cells. One could also use more refined sampling strategies such as geometrical sketching, k-means + KNN etc;
- adding DA as a numeric covariate in the design.
I don't like 1. because I would be literally throwing counts away (especially if DA is strong) and I would introduce an otherwise unnecessary random element, and I don't like 2. either, because it looks like an overly strong correction.
An argument against accounting for DA is that the inclusion of depth size factors in the model would take care of any systematic, global difference given by compositional effects. However, I worry that genes that are not consistently quantified across cells would be artificially inflated by this unequal coverage, i.e. by virtue of having many 0's they would have a higher chance of being DE after aggregation, just because they were supported by more cells. I don't know if the count depth normalization would be enough to overcome this issue.
Moreover, even if we didn't care about single genes and focused more global results such as GSEA, there is no guarantee that the same pathways will be significantly enriched before and after downsampling, or that the enrichment score will follow the same direction, because rankings may change if we downsample cells and "lose" counts for several genes. Honestly, I don't even know how to go about benchmarking this particular aspect - I believe it is difficult to simulate pathway enrichments with currently available tools. I tested GSEA on DE results before and after downsampling on a dataset and it looks like some NES values for significant (in at least one of the two enrichment tests) pathways have an opposite sign.
I am familiar with the OSCA book chapter on handling DE and DA, but I don't feel like it addresses this particular question as it just suggests to run both analyses (e.g. here).
Do you have an informed opinion, or any suggestion?
Many thanks for the reply, Gordon.
I will try to clarify. My hypothesis is that the treatment strongly affects gene expression in cluster X, and I appreciate that, provided there are enough replicates, pseudobulking and DE is the ideal approach. However, and here may lie my mistake in using the term DA, I also see that cluster X is composed of more cells belonging to the treatment group compared to the control. This is what I refer to when I say that there is a non-negligible DA effect.
My question was then whether I could still interpret the results of DE in the same way I would if I had, for instance, isolated cluster X via FACS and performed my experiment followed by bulk RNA-seq. I appreciate how DE and DA are tightly linked, so it may very well be that the question of whether my DE comes from differential composition or differential gene expression is irrelevant in this context.
I understood exactly what you meant. There wasn't any ambiguity.
Yes, in principle, pseudo-bulk is equivalent to FACS sorting the cells and doing bulk RNA-seq. That is the motivation for calling it "pseudo-bulk". That of course assumes that scRNA-seq can identify the cell-types perfectly, which is never going to be completely true. Bulk RNA-seq does not care that the cell type might become more or less common after the treatment. Bulk RNA-seq is simply assessing DE in the cells that do survive and can be isolated. Pseudo-bulk is analogous.
Assuming that you can sort the cell-types effectively, then DE and DA are complementary rather than linked. There is no a priori reason why a DA cell type also needs to show DE, although in practice it often does.
In your comment, you introduce a new term "differential composition" and I am unclear what you mean by that. Are you simply using "differential composition" as a synonym for DA? DE does not come from DA in the pseudo-bulk context, so adjusting for it is not a meaningful thing to do IMO. If you have do bulk RNA-seq on unsorted cells, then differential cell-type composition is a critical factor and will drive DE. If the cell types are sorted however that ceases to be a consideration. Pseudo-bulk assumes you have completely sorted cells.
Thanks Gordon for your additional comments, I did use "differential composition" as a synonym for DA. Your comments answer my question.