Question

DESeq2 design suggestions: one combined condition, or two separate conditions?

3

Entering edit mode

4.3 years ago

Matteo Schiavinato ★ 3.6k

Hi guys,

During my PhD and early postdoc I have performed a differential gene expression analysis more than once. In each case, I have used the classic workflow including htseq + DESeq2.

I know how it works, I know the pitfalls, I know the possible biases, but one theoretical question still stands still for me, although I have read the DESeq2 manual more than once.

If I have a dataset that looks like this:

Samples     Timepoint     Treatment     Replicate 
Sample_1    D1            A             R1
Sample_2    D1            A             R2
Sample_3    D1            A             R3
Sample_4    D1            B             R1
Sample_5    D1            B             R2
Sample_6    D1            B             R3
Sample_7    D2            A             R1
Sample_8    D2            A             R2
Sample_9    D2            A             R3
Sample_10   D2            B             R1
Sample_11   D2            B             R2
Sample_12   D2            B             R3

And I want to compare Treatment B vs Treatment A in both timepoints:

D1, B vs A
D2, B vs A

I would set the design variable in DESeqDataSetFromMatrix this way: I would combine Timepoint and Treament into a column called Condition which contains entries such as c("D1_A", "D1_B", "D2_A", "D2_B"). These are indeed the four "conditions" that I want to compare.

Then, I would call it like DESeqDataSetFromMatrix( ... , design = ~ Condition).

However, the DESeq2 manuals also show other ways of treating your samples. For example, I could have specified something like DESeqDataSetFromMatrix( ... , design = ~ Timepoint + Treatment + Timepoint:Treatment). What I don't understand is what does the third part of this design syntax mean, in mathematical and biological terms.

Does it indicate a scenario where the Timepoint and the Treatment columns are not entirely independent, in the experimental setup?

(e.g. if "treatment" was "viral load" and I could expect it to grow with time regardless of how I treat the samples).

DESeq2 mRNAseq rna-seq R Design • 2.0k views

ADD COMMENT • link updated 4.3 years ago by swbarnes2 14k • written 4.3 years ago by Matteo Schiavinato ★ 3.6k

3

Entering edit mode

The A:B notation indicates an interaction term. It tests whether the fold changes of a condition changes with regard to a second condition. In your case it would e.g. test whether the fold change between Treatment A and B is different in D1 versus D2. Still, there is a major pitfall (from what I understand, without being a statistics expert) towards interpretation, and this is that only the fold change between two groups is tested. Example: Below (see plot) lets imagine A / B would be time point 1 and C / D would be time point 2. Obviously the fold change in the second set is larger than in the first, so the interaction is likely to be significant, yet the second set has a lower overall expression base level. In order to make the claim that the second time point has a greater fold change and expression of the genes is generally higher (which most people probably want to test) than in the first set one probably would also need to test e.g. A versus C and make sure that C is indeed higher or at least equal to A.

The basic question that most people probably ask is whether the fold change gets larger in time point2 because C was somewhat similar to A and D would be higher than B. Probably a simple D-B could answer this rather than an interaction. It seems to me that interactions often require a second test to filter against unintuitive results for exactly what is shown in the plot below. Others may correct if I am wrong here, I personally always found interactions difficult to get my head around.

enter image description here

ADD REPLY • link 4.3 years ago by ATpoint 86k

score 1 · Answer 1 · 2020-10-02

Then, I would call it like DESeqDataSetFromMatrix( ... , design = ~ Condition)

Do this. Yes, you can use interactions and make a kind of complicated contrast which will do what you want, but this way is far more readable.

Interactions are used to answer questions like "Show me the genes where the treatment attenuated or heightened the changes from T1 to T2". You could manually do this by taking the fold change of A at the two timepoints, and subtracting the fold change of B from both time points, and the further your result is from 0, the more the treatment changed the time effect, but you would have no p-values. Interactions will do that subtraction and give you p-values.