Hi,
I am analyzing a single cell dataset. It is a mouse model of disease. 2 WT/control + 2 mutant/disease samples. I have already run the usual QC, clustering, cell type annotation, trajectory analysis, etc.
We have found a cell type that is only present in the disease samples. With multiple subtypes and some relationship with other "healthy" cell types (present still in the same disease samples). We want to further look into these cells and try to find the DEG and pahtways associated with these clusters.
I have previously looked into Seurat's FindMarkers and some single-cell methods for pathway enrichment, with disappointing results.
AFAIK, the consensus is that the best results for single cell data can be obtained by transforming the single cell data into pseudobulk and running "traditional" bulk RNA-seq methods on those pseudosamples.
I have been following this paper Confronting false discoveries in single-cell differential expression, where it seems that the best results are obtained by either edgeR
or DESeq2
, both using their LRT methods. However, as far as I understand, this paper benchmarked perturbation analyses across cell types. I am more interested, however, in DEG between specific cell types (pairwise) within the same samples.
Data
The data I have:
11 clusters. 3 cell types (A,B,C) with 3, 6, and 2 subtypes respectively.
- I tested this on just 2 comparisons, A1 vs A2, and B1 vs B2.
2 disease samples/biological replicates. I won't use the WT/controls because they lack our celltype of interest B.
pseudosamples generated just adding up the raw counts with
Matrix.utils::aggregate.Matrix(... , fun="sum")
Parameters:
The different parameters I tested:
Software + method:
- DESeq2
- Wald test
- LRT
- edgeR.
- exact test
- QLF
- LRT
- DESeq2
design:
- ~cluster
- ~cluster + sample
Samples in object:
- Only the 2+2 pseudosamples (2 clusters * 2 biological samples) for the clusters in the comparison.
- All 11 clusters * 2 bio samples (22 total) <- I did this only for DESeq2 because I don't know how to do it in edgeR.
Reduced model:
- when using
LRT
with a~cluster + sample
design:- reduced = ~1
- reduced = ~sample <- I did this only for DESeq2 because I don't know how to do it in edgeR.
- when using
logFC Shrinkage:
- none
- yes
- DESeq2:
lfcShrink(..., type="apeglm", ...)
- edgeR:
prior.count=5
- DESeq2:
- Pathway enrichment analysis:
fgsea
(using log2FC as "ranking" variable in all genes, not only significant DEG)- GO:BP
- GO:MF
(I have used DESeq2 a lot, but this is my first time using edgeR, so I might be doing something wrong.)
Results
And this is the number of significant up/down-DEG and enriched pathways in each method/combination of parameters:
A1 vs A2 B1 vs B2
DEG GO:bp GO:mf DEG GO:bp GO:mf
package data in object log2FC shrink. method design Down Up Down Up Down Up Down Up Down Up Down Up
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
DESeq2 all apeglm LRT reduced=~1 ~cluster 3212 3053 33 68 33 35| 175 232 24 83 7 19
DESeq2 all apeglm LRT reduced=~1 ~cluster + sample 3606 3502 31 60 30 31| 754 569 30 76 10 17
DESeq2 all apeglm LRT reduced=~sample ~cluster + sample 3517 3419 33 63 31 31| 754 569 30 79 9 18
DESeq2 all apeglm Wald ~cluster 1338 1170 33 70 32 33| 175 232 24 84 7 18
DESeq2 all apeglm Wald ~cluster + sample 2052 1657 30 61 32 29| 754 569 34 83 10 19
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
DESeq2 only comparison apeglm LRT reduced=~1 ~cluster 747 706 37 128 25 44| 68 163 14 76 8 22
DESeq2 only comparison apeglm LRT reduced=~1 ~cluster + sample 808 457 36 122 27 43| 79 104 13 65 7 21
DESeq2 only comparison apeglm LRT reduced=~sample ~cluster + sample 906 620 35 117 26 41| 57 127 12 61 7 21
DESeq2 only comparison apeglm Wald ~cluster 748 679 36 126 30 44| 68 160 13 72 7 22
DESeq2 only comparison apeglm Wald ~cluster + sample 907 619 33 125 26 41| 63 131 13 65 9 24
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
DESeq2 only comparison no LRT reduced=~1 ~cluster 747 706 8 3 11 4| 75 163 55 4 8 0
DESeq2 only comparison no LRT reduced=~1 ~cluster + sample 812 459 5 2 4 2| 87 107 62 6 10 0
DESeq2 only comparison no LRT reduced=~sample ~cluster + sample 906 620 5 2 6 2| 60 127 65 7 10 0
DESeq2 only comparison no Wald ~cluster 748 679 8 3 12 4| 75 160 59 6 8 0
DESeq2 only comparison no Wald ~cluster + sample 907 619 5 2 4 2| 68 131 65 7 10 0
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
edgeR only comparison no exact ~cluster 685 1545 5 2 3 5| 328 396 51 1 7 3
edgeR only comparison no exact ~cluster + sample 1723 2369 5 3 3 5| 561 750 38 1 5 3
edgeR only comparison no LRT ~cluster 793 1785 6 2 3 5| 118 178 58 1 7 3
edgeR only comparison no LRT ~cluster + sample 1782 2556 5 3 2 5| 110 215 43 1 5 3
edgeR only comparison no QLF ~cluster 441 1139 6 2 3 5| 196 352 52 1 6 3
edgeR only comparison no QLF ~cluster + sample 1539 2076 5 2 3 5| 471 761 51 1 7 3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
edgeR only comparison prior.count=5 exact ~cluster 685 1545 19 25 13 16| 328 396 45 4 17 6
edgeR only comparison prior.count=5 exact ~cluster + sample 1723 2369 22 36 18 19| 561 750 37 3 16 8
edgeR only comparison prior.count=5 LRT ~cluster 793 1785 21 30 13 16| 118 178 53 6 14 6
edgeR only comparison prior.count=5 LRT ~cluster + sample 1782 2556 18 28 14 17| 110 215 39 3 16 6
edgeR only comparison prior.count=5 QLF ~cluster 441 1139 21 30 17 17| 196 352 39 3 18 6
edgeR only comparison prior.count=5 QLF ~cluster + sample 1539 2076 17 26 13 17| 471 761 45 3 13 6
I notice several trends:
The overlap of DEGs is pretty decent across all methods (data not shown). Most differences are driven by some combinations of parameters reporting a widely different number of DEG. However, the genes tend to be "in line".
- The overlap of down-regulated genes can be weaker when comparing results from DESeq2 and edgeR. The up-regulated genes show better overlaps.
- Similarly, the overlap of pathways looks good, with differences depending on the number reported.
Using all samples in the data object has a big effect in the A1 vs A2 comparison, but very little in the B1 vs B2. The pathways in A1 vs A2 are useless, with many terms too general and not related at all with the tissue/cell type studied.
When using only the 2 cell types of interest in the data object, edgeR returns more DEG than DESeq2.
Method:
- DESeq2: LRT returns more DEG than Wald when using all pseudosamples in the data object, but an almost identical number when using only those of the specific comparison.
- edgeR: when using ~cluster as the design, LRT > exact > QLF. However, when including sample as the covariate, the differences become much smaller.
Using logFC shrinkage allows
fgsea
to report more significantly enriched pathways. This is not surprising, as I am using logFC to rank genes there. Moving the log2FC of lowly-expressed genes towards 0 reduces the noise in fgsea and returns "better" pathways.- However, in the B1 vs B2 comparison, using shrinkage reduced the number of pathways in both DESeq2 and edgeR.
The choice of design/covariates has a big impact on DESeq2 when using all pseudosamples, but very small when using only the comparison of choice.
- using
~cluster + sample
has a much bigger effect in the number of DEG inedgeR than in DESeq2. However, it has almost no effect in the pathways.
- using
edgeR returns many more pathways than DESeq2 in the B1 vs B2 comparison. These 2 subtypes are much more similar than A1vsA2.
Questions:
Do you see anything striking in these results?
We are mostly interested in studying the differences of B, from the context of A. A1 and A2 are cell types already present in healthy samples (we saw that 100% in our data), while B is a cell type/profile derived from A and caused by disease. We want to test if B1 and B2 clusters are derived from "diseased" A1 and A2 cells respectively. We want to see if the differences (DEG, pahtways) between B1 and B2 are similar to the differences between A1 and A2. As well as infer what is happening in cells of B.
Seeing the different results of the different methods/parameters in A1vsA2 and B1vsB2 comparisons, which would you advise us to chose?
Why is edgeR reporting so many DEG compared to DESeq2 in the B1vsB2 comparison? Is is more sensitive to small differences? Is it more prone to false positives?
Should we use all (22) pseudosamples in the data object, or only the 2+2 of the comparison of interest?
Should we use a simple design
~cluster
, or is it advisable to use~cluster + sample
for pseudobulk analyses trying to find differences between cell types? One of our samples always has more cells (of all cell types) than the other. Should we also include the number of cells per sample as covariates?Is it worth using LRT methods over Wald/Exact for simple pairwise comparisons? Is it more correct to use
reduced=~1
orreduced=~sample
in this type of analysis?I wouldn't mind running both DESeq2 and edgeR, and use only the DEG and pathways that are common in the results of both methods. However, how should I match the choice of parameters in both packages?
Do you have any advice on how should I proceed?
A ton of thanks in advance,
Txema
Thank you for your insights!
Yes, I also lack formal stats education. I tend to use simple tests because they are easier to interpret (both for me, and for the wet lab people generating the data).
I will give a try to run the data separately, and I'll apply a >= 15% cells minimum cutoff. The idea of subsetting all samples to the same number of cells seems interesting, but some of my subtypes have very few cells compared to others (~60 vs ~3k) and I'm afraid that would end up being a problem.
And I might try to run both DESeq2 (Wald) and edgeR (exact), to keep the consensus list of significant genes and pathways from both. I suppose that will help reduce spurious results from lowly expressed genes with unrealistic log2FCs, no?
Again, thank you very much for your input.