Question

Time course experiments in DESeq2

1

Entering edit mode

8.8 years ago

Chris Gene ▴ 80

Hello,

I'm having a hard time understanding the time course experimental work flow in DESeq2. I am using the F1000 paper as a template (Love MI, Anders S, Kim V and Huber W. RNA-Seq workflow: gene-level exploratory analysis and differential expression [version 1; referees: 2 approved]. F1000Research 2015, 4:1070 (doi: 10.12688/f1000research.7035.1))

Experiment:

I have 2 genotypes, 3 timepoints per genotype. 4 replicates in each genotype-condition pair (3 in one). I have a counts table generated in HTSeq - 23 samples.

coldata <- data.frame(sample=colnames(data),
                          genotype=as.factor(c(rep("het",4), rep("wt",4), rep("het",4), rep("wt",4),rep("het",3), rep("wt",4))),
                          time=as.factor(c(rep("es",8), rep("np",8),rep("nsc",7))))

sample genotype time 
1 es1 het es   
2 es2 het es   
3 es3 het es   
4 es4 het es   
5 es5 wt es   
6 es6 wt es   
7 es7 wt es   
8 es8 wt es   
9 np1 het np   
10 np2 het np   
11 np3 het np   
12 np4 het np   
13 np5 wt np   
14 np6 wt np   
15 np7 wt np   
16 np8 wt np   
17 nsc1 het nsc   
18 nsc2 het nsc   
19 nsc4 het nsc   
20 nsc5 wt nsc   
21 nsc6 wt nsc   
22 nsc7 wt nsc   
23 nsc8 wt nsc

I have tried it 2 different ways, with slightly different results. My questions are:

What is the difference between the two? I have used both elements as factors.
It's not clear how the relevel is working under these conditions
Using the same results table, can I compare the intermediate time points, i.e, 1st and 2nd then 2nd and 3rd, generating a results table for each, or would I have to separate the analysis entirely?

Thanks!

Here goes what I have done:

dds <- DESeqDataSetFromMatrix(countData=data, colData=coldata, design=~ time + genotype:time + genotype)
colData(dds)$genotype <- relevel(colData(dds)$genotype, "wt")
ddsTC <- DESeqDataSetFromMatrix(countData=data, colData=coldata, 
                                design=~ time + genotype:time + genotype)
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ time + genotype)
resTC <- results(ddsTC)
resTC$symbol <- mcols(ddsTC)$symbol
head(resTC[order(resTC$padj),],4)

Results:

log2 fold change (MLE): timensc.genotypewt 
LRT p-value: '~ time + genotype:time + genotype' vs '~ time + genotype' 
DataFrame with 4 rows and 6 columns

summary(resTC)
out of 29372 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 9, 0.031% 
LFC < 0 (down)   : 20, 0.068% 
outliers [1]     : 6, 0.02% 
low counts [2]   : 9434, 32% 
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

resultsNames(ddsTC)

[1] "Intercept"               "genotype_wt_vs_het"      "condition_np_vs_es"     
[4] "condition_nsc_vs_es"     "genotypewt.conditionnp"  "genotypewt.conditionnsc"

Second iteration:

resTC2 <- results(ddsTC, contrast=c("genotype","het","wt"))
head(resTC2[order(resTC2$padj),],4)

log2 fold change (MLE): genotype het vs wt 
LRT p-value: '~ time + genotype:time + genotype' vs '~ time + genotype' 
DataFrame with 4 rows and 6 columns

summary(resTC2)

out of 29372 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 16, 0.054% 
LFC < 0 (down)   : 13, 0.044% 
outliers [1]     : 6, 0.02% 
low counts [2]   : 9434, 32% 
(mean count < 3)

sessionInfo()

R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] sva_3.18.0                 genefilter_1.52.1         
 [3] mgcv_1.8-12                nlme_3.1-126              
 [5] PoiClaClu_1.0.2            hexbin_1.27.1             
 [7] RColorBrewer_1.1-2         vsn_3.38.0                
 [9] ggplot2_2.1.0              pheatmap_1.0.8            
[11] DESeq2_1.10.1              RcppArmadillo_0.6.100.0.0 
[13] Rcpp_0.12.4                SummarizedExperiment_1.0.2
[15] Biobase_2.30.0             GenomicRanges_1.22.4      
[17] GenomeInfoDb_1.6.3         IRanges_2.4.8             
[19] S4Vectors_0.8.11           BiocGenerics_0.16.1       
[21] maptools_0.8-39            sp_1.2-2

HTSeq DESeq2 time-course RNA-seq • 9.5k views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 8.8 years ago by Chris Gene ▴ 80

score 2 · Answer 1 · 2016-04-29

hi Chris,

See the note in the function help page ?results about the LRT p-values, and then also read the section in the DESeq2 vignette about the likelihood ratio test:

vignette("DESeq2")

Note that you're only changing the log fold change not the p-values with your code. You should perform a Wald test on the interaction terms for individual time points if you are interested in testing those:

results(dds, name="...", test="Wald")

Also, you should set WT to the reference level, so the comparisons are het vs WT not WT vs het. Read the DESeq2 vignette on factor levels.

score 1 · Answer 2 · 2016-04-29

1

Entering edit mode

8.8 years ago

Charles Warden 8.3k

I would say it is OK to use DESeq2/DESeq/edgeR for comparison of two timepoints (or some sort of analysis that focuses around comparing two groups, which admittedly looks like it applies in this case), and you can Google come recommendations on how to set up time-series analysis with those packages.

However, I would recommend using limma-voom if you want to identify expression changes over time (due to a continuous variable): Maybe you will have better luck with this package.

https://bioconductor.org/packages/release/bioc/html/limma.html

See pages 69 and 46 in the limma usersguide.pdf file

ADD COMMENT • link 8.8 years ago by Charles Warden 8.3k

3

Entering edit mode

DESeq2 and edgeR are both capable of analyzing complex designs, such as time series with interaction terms, leveraging the same base R functionality as limma: the formula class and the model.matrix function. Where did you pick up that DESeq2 or edgeR are only for two group comparisons? You can even include smoothing functions like splines for DESeq2, edgeR or limma equally.

ADD REPLY • link 8.8 years ago by Michael Love ★ 2.6k

0

Entering edit mode

DESeq and edgeR both use a negative binomial distribution to calculate p-values. Even in the example above, time is being defined as a factor and not a continuous, numeric variable.

If the goal was to identify genes that change as a function of time without the genotype information, this would be problematic. In this case, "time" is defined a little differently than I would expect (as "es", "np", and "esc"). If instead time was defined as a continuous variable (I don't know what is appropriate in this case, but if you timepoints of 1, 2, and 3 days and you wanted a single p-value to reflect the change in expression over time), then I would expect DEseq to give an error message. For example, I get this error message if I try to run analysis on just time (for a different set of samples):

> dds <- DESeqDataSetFromMatrix(countData=counts, colData=coldata, design=~ time)
converting counts to integer mode
the design formula contains a numeric variable with integer values,
  specifying a model with increasing fold change for higher values.
  did you mean for this to be a factor? if so, first convert
  this variable to a factor using the factor() function

If I go back to your tutorial (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4670015/#__sec31title ), the question that I am concerned about would be "Can you use DESeq2 to identify changes that occur as a function of time (in this case, minutes) in only the WT samples?"

However, I do apologize for being off-topic due to 1) the way time is defined here and 2) the fact that there are still different genotypes being compared. The tutorial is an excellent resource for people whose goal is to identify genes that change between groups at multiple time points.

ADD REPLY • link 8.7 years ago by Charles Warden 8.3k

3

Entering edit mode

Thanks for replying, and I hope I can clear some things up:

The fact that DESeq2 and edgeR use a NB and limma-voom uses a log linear model isn't really relevant here.

It's fine to use time as a continuous variable. For example, you could use a design of ~genotype + f(time) + genotype:f(time) where you use some smooth function for f, such as smoothing splines, to find genes where there is a genotype-specific change in the dependence of expression on time. Or you can use ~f(time) and compare to ~1, or whatever you like. There are many ways in base R to specific this f, ns(), bs(), poly() etc., and users need to be familiar with these approaches to produce reasonable results.

That's not an error. That's just a message.

The point of this message is, sometimes users will input a column called "treatment", with levels 1,2,3,4, etc. And this column can easily be a numeric type, using the default R functions for reading in the colData. However, that would be a big modeling mistake, treatment 4 is not (treatment 2)^2, and so on.

So this message says essentially, "are you sure that's what you want to do?". It just prints the message and let's you continue with your analysis.

In R there are 3 levels of communication with the user:

message (here is some information)
warning (pay attention! but I'll let you continue) and
error (something is wrong and I won't let you continue).

These look like:

 > message("hi")
 hi
 > warning("hi")
 Warning message:
 hi
!> stop("hi")
 Error: hi

ADD REPLY • link 8.8 years ago by Michael Love ★ 2.6k

score 0 · Answer 3 · 2016-04-30

Thank you Michael.
I just realized I hadn't releveled the data at all.

When performing a Wald test, I'm not sure whether I should include the contrast argument. As I understand from the documentation, if I provide the interaction term under name, and the contrast, the latter will take priority and the function will ignore the name.
Perhaps my confusion stems from the fact that I haven't really understood what the last two interaction terms are referring to.

Now I have the following:

    ddsTC <- DESeqDataSetFromMatrix(countData=data, colData=coldata, 
                                design=~ time + genotype:time + genotype)
ddsTC$genotype <- relevel(ddsTC$genotype, ref="wt")

ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ time + genotype)
resTC <- results(ddsTC)
resTC$symbol <- mcols(ddsTC)$symbol
head(resTC[order(resTC$padj),],4)

# log2 fold change (MLE): timensc.genotypehet
# LRT p-value: '~ time + genotype:time + genotype' vs '~ time + genotype'
# DataFrame with 4 rows and 6 columns
# baseMean log2FoldChange lfcSE stat

I have the following terms:

    resultsNames(ddsTC)
## "Intercept"           "time_np_vs_es"       "time_nsc_vs_es"     "genotype_het_vs_wt"  "timenp.genotypehet"  "timensc.genotypehet"

If I use the last term "timensc.genotypehet", is that comparing the last timepoint (nc) with the first, and genotype het vs wt?
When I releveled, it was for genotype. Should I assume that for the time factor it levels based on alphabetical order? From the second and third interaction terms that's what it seems to be doing - which happens to be fine here, given the alphabetical order reflects the time, but in future experiments I would want to know that I am applying the formula correctly.

In another post on cell type, you suggest grouping the factors:
https://support.bioconductor.org/p/67600/#67612
Would that be appropriate here?
What is the difference between grouping the factors, and using a list for the contrast argument?

Thank you
Chris