Too many questions, indeed:)
Q1: Should i include the interaction term in the design?
That really depends if you expect an interaction between the two terms. In DESeq2, interaction terms in the design formula are useful when you want to test whether the effect of one variable depends on the level of another variable. You should use an interaction term when you suspect that the relationship between a factor and gene expression is different across levels of another factor.
Example: You have two conditions (e.g., control vs. treatment) and two genotypes (e.g., wild-type vs. mutant).
You want to check whether the treatment effect is different in the wild-type compared to the mutant.
If you're only interested in whether condition A vs. B changes gene expression on average, an interaction term is unnecessary.
Interaction terms require more statistical power. If you have a small sample size, adding an interaction might lead to unreliable estimates or overfitting.
Since the presence of an interaction is not always known beforehand, there is a way in DESeq2 to verify whether the interaction term is necessary.
Specify the full and reduced models, then run DESeq2 using the Likelihood Ratio Test (LRT)
instead of the default Wald test.
dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ genotype + condition + genotype:condition)
dds <- DESeq(dds, test="LRT", reduced = ~ genotype + condition)
Now, get the results to check whether the interaction term significantly improves model fit.
res <- results(dds)
If the interaction term is significant, this means that the effect of treatment depends on genotype, and you should keep the intersction term and use the full model. If not significant, you can simplify your model and drop the interaction term
Interpret the LRT Results:
Significant p-value (e.g., p < 0.05) from the LRT: This indicates that the interaction term is statistically significant. This means the effect of treatment does depend on genotype. In this case, the full model is more appropriate. You should use the results from the full model to understand how treatment affects gene expression within each genotype. The main effects from the reduced model are less meaningful here.
Large p-value (e.g., >= 0.05) from the LRT: This suggests that the interaction term is not statistically significant. The effect of treatment is likely similar across genotypes. In this case, the reduced model is more appropriate. It's simpler and easier to interpret. You can look at the main effects of genotype and treatment directly.
Q2: The result without or with using the interaction term is same. How can
that be? Am i doing this correctly?
By default, the results()
in DESeq2 selects the last variable in the design formula and compares its last level against its first level. According to the "results" manual (?results
):
"If results is run without specifying contrast or name, it will return
the comparison of the last level of the last variable in the design
formula over the first level of that variable. For a simple two-group
comparison, this means it will return the log2 fold changes of the
second group relative to the first (reference) group."
If you run results(dds)
without specifying any arguments, it will automatically display the results for the last contrast (comparison) listed in resultsNames(dds)
, which includes all possible contrasts (comparisons). If unsure, you can always execute results(dds)
and check the first line of the output, where the actual contrast being used will be displayed, similar to this:
log2 fold change (MLE): genotypeII.conditionB
Please refer to the examples from the DESeq2 manual (?DESeq
) for proper usage and interpretation of all the contrasts. For convenience, I have included Example 2, which include an interaction term, verbatim below.
dds <- makeExampleDESeqDataSet(n=100,m=12)
dds$genotype <- factor(rep(rep(c("I","II"),each=3),2))
design(dds) <- ~ genotype + condition + genotype:condition
dds <- DESeq(dds)
resultsNames(dds)
results(dds, contrast=c("condition","B","A"))
results(dds, list( c("condition_B_vs_A","genotypeII.conditionB") ))
results(dds, name="genotypeII.conditionB")
design(dds) <- ~ genotype + condition + genotype:condition
dds <- DESeq(dds)
resultsNames(dds)
Q3: So should i run the dds command taking the whole data or should i
subset it first to my specific interest and then run the dds?
The official recomendation is to run all of them at once with whole data. See If I have multiple groups, should I run all together or split into pairs of groups?
.."Typically, we recommend users to run samples from all groups
together, and then use the contrast argument of the results function
to extract comparisons of interest after fitting the model using
DESeq."
However, if you have multiple groups with significant differences in inter-group variability, it is better to subset the data. This approach provides greater statistical power and yields more differentially expressed genes.
HTH!
According to the DESeq2 manual : "If results is run without specifying contrast or name, it will return the comparison of the last level of the last variable in the design formula over the first level of that variable. For a simple two-group comparison, this means it will return the log2 fold changes of the second group relative to the first (reference) group."
Thanks alot for the help.