Hi guys,
During my PhD and early postdoc I have performed a differential gene expression analysis more than once. In each case, I have used the classic workflow including htseq + DESeq2.
I know how it works, I know the pitfalls, I know the possible biases, but one theoretical question still stands still for me, although I have read the DESeq2 manual more than once.
If I have a dataset that looks like this:
Samples Timepoint Treatment Replicate
Sample_1 D1 A R1
Sample_2 D1 A R2
Sample_3 D1 A R3
Sample_4 D1 B R1
Sample_5 D1 B R2
Sample_6 D1 B R3
Sample_7 D2 A R1
Sample_8 D2 A R2
Sample_9 D2 A R3
Sample_10 D2 B R1
Sample_11 D2 B R2
Sample_12 D2 B R3
And I want to compare Treatment B vs Treatment A in both timepoints:
D1, B vs A
D2, B vs A
I would set the design
variable in DESeqDataSetFromMatrix
this way: I would combine Timepoint
and Treament
into a column called Condition
which contains entries such as c("D1_A", "D1_B", "D2_A", "D2_B")
. These are indeed the four "conditions" that I want to compare.
Then, I would call it like DESeqDataSetFromMatrix( ... , design = ~ Condition)
.
However, the DESeq2 manuals also show other ways of treating your samples. For example, I could have specified something like DESeqDataSetFromMatrix( ... , design = ~ Timepoint + Treatment + Timepoint:Treatment)
. What I don't understand is what does the third part of this design syntax mean, in mathematical and biological terms.
Does it indicate a scenario where the Timepoint and the Treatment columns are not entirely independent, in the experimental setup?
(e.g. if "treatment" was "viral load" and I could expect it to grow with time regardless of how I treat the samples).
The
A:B
notation indicates an interaction term. It tests whether the fold changes of a condition changes with regard to a second condition. In your case it would e.g. test whether the fold change between Treatment A and B is different in D1 versus D2. Still, there is a major pitfall (from what I understand, without being a statistics expert) towards interpretation, and this is that only the fold change between two groups is tested. Example: Below (see plot) lets imagine A / B would be time point 1 and C / D would be time point 2. Obviously the fold change in the second set is larger than in the first, so the interaction is likely to be significant, yet the second set has a lower overall expression base level. In order to make the claim that the second time point has a greater fold change and expression of the genes is generally higher (which most people probably want to test) than in the first set one probably would also need to test e.g. A versus C and make sure that C is indeed higher or at least equal to A.The basic question that most people probably ask is whether the fold change gets larger in time point2 because C was somewhat similar to A and D would be higher than B. Probably a simple D-B could answer this rather than an interaction. It seems to me that interactions often require a second test to filter against unintuitive results for exactly what is shown in the plot below. Others may correct if I am wrong here, I personally always found interactions difficult to get my head around.