Question

DGE analysis with more than 2+ groups: Compare to reference or compare to all other groups?

1

Entering edit mode

3.9 years ago

psm ▴ 130

Good morning all - I typically perform RNAseq DGE analysis using DESeq2, which forces you to set one group as "reference". The results obtained show differentially expressed genes for each group relative to reference

However, if instead I was interested in looking what genes uniquely marks each group, I could recombine groups (e.g. Group A vs B+C+D, B vs A+C+D, etc). This would be simple enough to do, and it seems to me like a fairly intuitive thing to do - kind of like "marker gene analysis" for clusters in single cell RNAseq. I just wonder why I haven't come across such a workflow performed by others (I realize it partially comes down to what the scientific question is).

Any thoughts as to whether this a scientifically/ statistically sound approach?

Cheers!

RNA-Seq • 4.0k views

ADD COMMENT • link updated 3.9 years ago by ATpoint 86k • written 3.9 years ago by psm ▴ 130

score 3 · Answer 1 · 2021-02-01

It is slightly more complex that you think. The crux here is your definiton of e.g. B+C+D in your first example. It makes a difference whether you ask for genes in A that are differential in direct pairwise comparion (A-B,A-C,A-D) or whether you find genes which are differential against B/C/D given that the latter three were a single group. More samples per group alter the dispersion estimation and the power of your experiment while putting less weight on each individual sample. If you define B-C-D as a group you might find genes which are higher in A versus this group but the direct comparison of e.g. A vs B might find this gene as not being differential. This could be the case if the gene was high in A, slightly (bit not significantly lower) in B, and low in C/D. The average of the second group might be low enough to call this gene as differential in the B-C-D group while A-B would not be differential.

The question would be what you want to answer. If you want "marker" genes, being genes that robustly separate a given group from all other groups then you would need in fact test all possible unique comparisons, A-B, A-C,A-D, B-C, D-B and so on...and then combine the DE statistics to obtain a suitable ranking for markers per condition. There are approaches to combine DE statistics into marker lists, e.g. the scran package has a combineMarkers function which mainly focuses on combining the p-values together with an option to specify in how many of the comparisons a gene must be differential to serve as a candidate marker, see ?combineMarkers.

As said, depends on your question. Can you elaborate?

score 2 · Answer 2 · 2021-02-01

You can specify any contrast in DESeq2 using the contrast argument of results. For example, contrast=c("condition", "A", "B") and contrast=c("condition", "B", "A") are both valid. You are only limited by reference level if you use the name argument to return results.

It's also fine to (for example) combine the factor levels B, C, and D into one factor level and compare it to A. You would tend to get genes whose response relative to A is similar across the three origin levels. A common example of this would be using DESeq2 to find cluster markers in scRNA-seq (either using pseudobulk or not).

In an ideal world you could include the original factor levels as random variables along with your combined factor levels, but neither DESeq2 nor edgeR support this.