Hello,
I'm having a hard time understanding the time course experimental work flow in DESeq2. I am using the F1000 paper as a template (Love MI, Anders S, Kim V and Huber W. RNA-Seq workflow: gene-level exploratory analysis and differential expression [version 1; referees: 2 approved]. F1000Research 2015, 4:1070 (doi: 10.12688/f1000research.7035.1))
Experiment:
I have 2 genotypes, 3 timepoints per genotype. 4 replicates in each genotype-condition pair (3 in one). I have a counts table generated in HTSeq - 23 samples.
coldata <- data.frame(sample=colnames(data),
genotype=as.factor(c(rep("het",4), rep("wt",4), rep("het",4), rep("wt",4),rep("het",3), rep("wt",4))),
time=as.factor(c(rep("es",8), rep("np",8),rep("nsc",7))))
sample genotype time
1 es1 het es
2 es2 het es
3 es3 het es
4 es4 het es
5 es5 wt es
6 es6 wt es
7 es7 wt es
8 es8 wt es
9 np1 het np
10 np2 het np
11 np3 het np
12 np4 het np
13 np5 wt np
14 np6 wt np
15 np7 wt np
16 np8 wt np
17 nsc1 het nsc
18 nsc2 het nsc
19 nsc4 het nsc
20 nsc5 wt nsc
21 nsc6 wt nsc
22 nsc7 wt nsc
23 nsc8 wt nsc
I have tried it 2 different ways, with slightly different results. My questions are:
- What is the difference between the two? I have used both elements as factors.
- It's not clear how the relevel is working under these conditions
- Using the same results table, can I compare the intermediate time points, i.e, 1st and 2nd then 2nd and 3rd, generating a results table for each, or would I have to separate the analysis entirely?
Thanks!
Here goes what I have done:
dds <- DESeqDataSetFromMatrix(countData=data, colData=coldata, design=~ time + genotype:time + genotype)
colData(dds)$genotype <- relevel(colData(dds)$genotype, "wt")
ddsTC <- DESeqDataSetFromMatrix(countData=data, colData=coldata,
design=~ time + genotype:time + genotype)
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ time + genotype)
resTC <- results(ddsTC)
resTC$symbol <- mcols(ddsTC)$symbol
head(resTC[order(resTC$padj),],4)
Results:
log2 fold change (MLE): timensc.genotypewt
LRT p-value: '~ time + genotype:time + genotype' vs '~ time + genotype'
DataFrame with 4 rows and 6 columns
summary(resTC)
out of 29372 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 9, 0.031%
LFC < 0 (down) : 20, 0.068%
outliers [1] : 6, 0.02%
low counts [2] : 9434, 32%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
resultsNames(ddsTC)
[1] "Intercept" "genotype_wt_vs_het" "condition_np_vs_es"
[4] "condition_nsc_vs_es" "genotypewt.conditionnp" "genotypewt.conditionnsc"
Second iteration:
resTC2 <- results(ddsTC, contrast=c("genotype","het","wt"))
head(resTC2[order(resTC2$padj),],4)
log2 fold change (MLE): genotype het vs wt
LRT p-value: '~ time + genotype:time + genotype' vs '~ time + genotype'
DataFrame with 4 rows and 6 columns
summary(resTC2)
out of 29372 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 16, 0.054%
LFC < 0 (down) : 13, 0.044%
outliers [1] : 6, 0.02%
low counts [2] : 9434, 32%
(mean count < 3)
sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] sva_3.18.0 genefilter_1.52.1
[3] mgcv_1.8-12 nlme_3.1-126
[5] PoiClaClu_1.0.2 hexbin_1.27.1
[7] RColorBrewer_1.1-2 vsn_3.38.0
[9] ggplot2_2.1.0 pheatmap_1.0.8
[11] DESeq2_1.10.1 RcppArmadillo_0.6.100.0.0
[13] Rcpp_0.12.4 SummarizedExperiment_1.0.2
[15] Biobase_2.30.0 GenomicRanges_1.22.4
[17] GenomeInfoDb_1.6.3 IRanges_2.4.8
[19] S4Vectors_0.8.11 BiocGenerics_0.16.1
[21] maptools_0.8-39 sp_1.2-2
DESeq2 and edgeR are both capable of analyzing complex designs, such as time series with interaction terms, leveraging the same base R functionality as limma: the formula class and the model.matrix function. Where did you pick up that DESeq2 or edgeR are only for two group comparisons? You can even include smoothing functions like splines for DESeq2, edgeR or limma equally.
DESeq and edgeR both use a negative binomial distribution to calculate p-values. Even in the example above, time is being defined as a factor and not a continuous, numeric variable.
If the goal was to identify genes that change as a function of time without the genotype information, this would be problematic. In this case, "time" is defined a little differently than I would expect (as "es", "np", and "esc"). If instead time was defined as a continuous variable (I don't know what is appropriate in this case, but if you timepoints of 1, 2, and 3 days and you wanted a single p-value to reflect the change in expression over time), then I would expect DEseq to give an error message. For example, I get this error message if I try to run analysis on just time (for a different set of samples):
If I go back to your tutorial (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4670015/#__sec31title ), the question that I am concerned about would be "Can you use DESeq2 to identify changes that occur as a function of time (in this case, minutes) in only the WT samples?"
However, I do apologize for being off-topic due to 1) the way time is defined here and 2) the fact that there are still different genotypes being compared. The tutorial is an excellent resource for people whose goal is to identify genes that change between groups at multiple time points.
Thanks for replying, and I hope I can clear some things up:
The fact that DESeq2 and edgeR use a NB and limma-voom uses a log linear model isn't really relevant here.
It's fine to use time as a continuous variable. For example, you could use a design of ~genotype + f(time) + genotype:f(time) where you use some smooth function for f, such as smoothing splines, to find genes where there is a genotype-specific change in the dependence of expression on time. Or you can use ~f(time) and compare to ~1, or whatever you like. There are many ways in base R to specific this f, ns(), bs(), poly() etc., and users need to be familiar with these approaches to produce reasonable results.
That's not an error. That's just a message.
The point of this message is, sometimes users will input a column called "treatment", with levels 1,2,3,4, etc. And this column can easily be a numeric type, using the default R functions for reading in the colData. However, that would be a big modeling mistake, treatment 4 is not (treatment 2)^2, and so on.
So this message says essentially, "are you sure that's what you want to do?". It just prints the message and let's you continue with your analysis.
In R there are 3 levels of communication with the user:
These look like: