Hi all,
I am performing a differential expression analysis with DESeq2, with these data:
- control, 2 replicates
- treatment, 2 replicates
So far there is still one obscure part of the manual for me: the design variable that you can set in many commands. I grasped the concept behind it but I am still struggling to understand how to use it properly. At the moment, I am using:
design = ~ condition
How would you set it, and why? Could someone write a couple of lines on how should I use that variable properly?
Any help appreciated!
With processed you mean sequenced or quality filtered?
I mean the RNA extraction and/or library preparation.
For instance if the two first replicates were extracted together one day while the two second replicates were extracted the day after, you could expect some kind of technical variation to affect gene expression. This is called batch effect, which is annoying. The good thing is that DESeq2 can take it into account in its model.
If all your replicates were processed in parallel, then there is no batch effect. This is an ideal situation.
If you processed the two replicates of the control condition one day, and the two replicates of the treatment condition another day, then there is a batch effect, but you can not control for it. This is the worst situation.
Thank you. Mine is the first scenario, I will add up the batch term in the design. However: is there a list, or a manual or something (not the official one of DESeq2 which I already read) that explains clearly which terms can go in the design function?
the limma user's guide contains a very good introduction to linear models of designed experiments, maybe have a look at the model.matrix help page as well. model.matrix(~ condition) will define a 4x2 matrix containing an 'intercept' column of all-ones and a column containing two 0s (for the controls) and two 1s (for the treatments). DESeq2 fits a coefficient for each column in the design matrix.
From your question, I feel (but I could be wrong) that you think that only specific terms are allowed in the design function. This is not the case. The name of the factor doesn't matter at all. For instance instead of
condition = as.factor(c("control","treatment")
you could writedrug = as.factor(c("YES","NO")
orazerty = as.factor(c("hello","world")
.The design should simply include all the factors that are expected to affect gene expression in your experiment. In your case, the treatment and the batch, whatever the names you give them.
You were right, now I got a piece more in my puzzle. New question: When using ~, or +, what does change? I mean, except from arithmetical things, is there any praxis that I should know? I'm gonna look through the limma documentation as well.
This is the usual synthax for "formula" (see
?formula
in R).~ means that the folowing terms will be the factors in your design.
+ is used to add factors (note that
~ condition + batch
is the same as~ batch + condition
)You also have the operators
*
and:
that are used to specify interactions between factors (not needed in your specific case).More info here and here in the context of ANOVA and linear regression, respectively.
That was exactly what I was looking for. Thank you.