Question

[DESeq2] time series analysis. Linear and squared model

2

Entering edit mode

10.5 years ago

john ▴ 30

Hello guys,

I am doing some RNA-seq analysis with DESeq2. Time course 0-48 h with one treatment of the cells. 0h is not treated and no replicates do exist.

I want to find out whether there is a linear response between time and gene expression (where time is considered to be a continuous covariate).

This is what I did so far:

>ddsfm <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ time)
> head(assay(ddsfm))
         h0_1 h2_2 h4_3 h6_4 h8_5 h10_6 h12_7 h14_8 h24_9 h48_10
A1BG-AS1   17   12   19   47   28     4    12    36    31    114
A1CF        0    1    2    2    4     0     0     5     3     22
A2M         3    4    6   18   11    20    20    18    17     60
A2M-AS1     0    3    1    6    0    10     4     9     2     12
A2ML1       0    0    3    4    6     2     2     2     3      7
A2MP1       9   20   34  144   97     5    35    72    69    112

> coldata
       intercept time
h0_1           1   h0
h2_2           1   h2
h4_3           1   h4
h6_4           1   h6
h8_5           1   h8
h10_6          1  h10
h12_7          1  h12
h14_8          1  h14
h24_9          1  h24
h48_10         1  h48

> dds=DESeq(ddsfm,test="LRT",reduced= ~ 1, fitType='local')

> resultsNames(dds)
 [1] "Intercept"      "time_h10_vs_h0" "time_h12_vs_h0" "time_h14_vs_h0"
 [5] "time_h2_vs_h0"  "time_h24_vs_h0" "time_h4_vs_h0"  "time_h48_vs_h0"
 [9] "time_h6_vs_h0"  "time_h8_vs_h0"

As far as I understood, by calling

> res=results(dds,name="Intercept")

I get the genes that respond to time. Correct?

So I just want to know if this is fine so far. Any thoughts are greatly appreciated.

The results look like that.

> resOrdered <- res[order(res$pvalue),]
> head(resOrdered)
log2 fold change (MLE): Intercept
LRT p-value: '~ time' vs '~ 1'
DataFrame with 6 rows and 6 columns
          baseMean log2FoldChange     lfcSE      stat     pvalue      padj
         <numeric>      <numeric> <numeric> <numeric>  <numeric> <numeric>
LZTR1     46.28472       2.895392  2.466350  15.41700 0.08010024 0.9998715
FLJ37453 128.43150       5.558342  2.031662  15.32917 0.08228201 0.9998715
HMBS      47.71543       2.310480  2.494872  15.25735 0.08410516 0.9998715

Cheers
John

Well it seems like that spanish guy from here comparing each time point with deseq2 and their results had a similar question.

Also since my p-values are not really low it seems like time has no influence on the gene's expression.

rna-seq deseq2 • 5.8k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.5 years ago by john ▴ 30

1

Entering edit mode

Keep in mind that your model isn't looking for a linear affect of time on gene expression. If that's really what you want (it's what you wrote), then you need time as a numeric covariate rather than a factor.

ADD REPLY • link 10.5 years ago by Devon Ryan 105k

0

Entering edit mode

Hi Ryan,

Thanks for your response.

I was basically following the guy here DESEq2 warning while using GLM fitting

> colData(dds)
DataFrame with 10 rows and 3 columns
           intercept     time sizeFactor
            <factor> <factor>  <numeric>
h0_249_1           1       h0  0.4031904
h2_249_2           1       h2  0.5897333
h4_249_3           1       h4  0.9342609
h6_249_4           1       h6  1.9014400
h8_249_5           1       h8  0.9675656
h10_249_6          1      h10  1.0818049
h12_249_7          1      h12  0.6057505
h14_249_8          1      h14  1.2609993
h24_249_9          1      h24  0.9578546
h48_249_10         1      h48  3.7064809

Now I have got:

resultsNames(dds)
[1] "Intercept" "time"     
> colData(dds)
DataFrame with 10 rows and 3 columns
           intercept      time sizeFactor
           <numeric> <numeric>  <numeric>
h0_249_1           1         0  0.4031904
h2_249_2           1         2  0.5897333
h4_249_3           1         4  0.9342609
h6_249_4           1         6  1.9014400
h8_249_5           1         8  0.9675656
h10_249_6          1        10  1.0818049
h12_249_7          1        12  0.6057505
h14_249_8          1        14  1.2609993
h24_249_9          1        24  0.9578546
h48_249_10         1        48  3.7064809

Also my results seem different now:

> resOrdered <- res[order(res$padj),]
> head(resOrdered)
log2 fold change (MLE): time
LRT p-value: '~ time' vs '~ 1'
DataFrame with 6 rows and 6 columns
            baseMean log2FoldChange       lfcSE      stat       pvalue
           <numeric>      <numeric>   <numeric> <numeric>    <numeric>
KRT8     1612.214997    -0.05406463 0.008487662  32.71247 1.068497e-08
LOC1720     3.621876     0.13374204 0.029702967  30.17952 3.938465e-08
WDR65      55.149914    -0.05988746 0.011911142  22.82437 1.775008e-06
ANO5        2.897376     0.11646892 0.031397022  19.90285 8.147898e-06
ARRDC4      2.519225     0.11579147 0.030950616  19.37764 1.072556e-05
HLA-DPA1   69.633022     0.18477908 0.038431829  18.44600 1.747874e-05
                 padj
            <numeric>
KRT8     8.469979e-05
LOC1720  1.561010e-04
WDR65    4.690162e-03
ANO5     1.614710e-02
ARRDC4   1.700430e-02
HLA-DPA1 1.979342e-02

We might get something usefull out of that after all. I'll continue tomorrow. Thanks so far. (-:

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by john ▴ 30

0

Entering edit mode

Yeah, it's really just a matter of what question you want to ask. Your original model was more flexible, but you really need replicates for it to be useful. With the model above, you're restricted to finding genes that change in a more or less linear fashion with time (e.g., if expression doubles between 0h and 2h, then it should then double again every two hours), but you don't really need much in the way of replicates.

BTW, the fold changes produced are now "per hour".

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Devon Ryan 105k

0

Entering edit mode

Fold changes "per hour" - nice to know.

Ryan Devon: "but you don't really need much in the way of replicates."

I suppose that means that I don't really need replicates here.

One could even assume there is an quadratic (time²) fold change. I did this like that:

> colData(dds)
DataFrame with 10 rows and 4 columns
           intercept      time timeSquared sizeFactor
           <numeric> <numeric>   <numeric>  <numeric>
h0_249_1           1         0               0          0.4031904
h2_249_2           1         2               4          0.5897333
h4_249_3           1         4              16          0.9342609
h6_249_4           1         6              36          1.9014400
h8_249_5           1         8              64          0.9675656
h10_249_6          1        10             100      1.0818049
h12_249_7          1        12             144      0.6057505
h14_249_8          1        14             196      1.2609993
h24_249_9          1        24             576      0.9578546
h48_249_10         1        48            2304      3.7064809

So this time I would like to know what genes fit either to a linear time model or a squared time model (or to no time).

I defined my model like this:

>ddsfm <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ time+timeSquared)
>dds=DESeq(ddsfm,test="LRT",reduced= ~ 1, fitType='local')

> resultsNames(dds)
[1] "Intercept"        "time.timeSquared"    #<---- This is not what I want!

How is it done properly?

I could just do two separate runs

one with design = ~ time reduced= ~1
the other with design = ~ timeSquared reduced= ~1

But I am sure there is a better way.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.4 years ago by john ▴ 30

0

Entering edit mode

So anyway I did two separate runs one with ~ Lineartime' vs '~ 1' and one with ~ Squaredtime' vs '~ 1'

Here one can see sufficient low pvalues but then looking at padj it tells me otherwise.

Can these first few genes considered to be differentially expressed?

>resOrdered <- res[order(res$padj),]
>head(resOrdered)

LRT p-value: '~ time' vs '~ 1'
DataFrame with 23709 rows and 5 columns
           baseMean log2FoldChange       stat       pvalue                padj
          <numeric>      <numeric>  <numeric>    <numeric>    <numeric>
KRT8    1612.214997    -0.05406463   32.71247 1.068497e-08   8.469979e-05
LOC1720    3.621876     0.13374204   30.17952 3.938465e-08   1.561010e-04
WDR65     55.149914    -0.05988746   22.82437 1.775008e-06  4.690162e-03
ANO5       2.897376     0.11646892   19.90285    8.147898e-06  1.614710e-02
ARRDC4     2.519225     0.11579147   19.37764 1.072556e-05   1.700430e-02

In the time squared model the padj values seem even worse.

LRT p-value: '~ timeSquared' vs '~ 1'
DataFrame with 23709 rows and 5 columns
              baseMean log2FoldChange      stat           pvalue                  padj
             <numeric>      <numeric> <numeric>    <numeric>      <numeric>
LOC1720       3.621876    0.002265750  21.06191 4.446813e-06    0.07646739
KRT8       1612.214997   -0.001006767  16.32833 5.326171e-05    0.45794416
ANO5          2.897376    0.002025348  15.26385   9.348877e-05    0.46131135
ARRDC4        2.519225    0.001933524  14.07206 1.759376e-04    0.46131135
CHRNA7        3.186293    0.001550793  13.95561 1.871785e-04    0.46131135

I am not sure how to handle the high padj values in that context...

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by john ▴ 30

0

Entering edit mode

Yeah, I'd take the first few genes of the first result set as being significant. For the quadratic time model, apparently there's simply not much that can reasonably fits that model.

ADD REPLY • link 10.4 years ago by Devon Ryan 105k

Ram · Answer 1 · 2015-03-11

Ok I'll do that!

What's also puzzling me is the high number of NAs in the padj column.

15.000 out of 23.000 genes have NA.

I had a look at the data and this is either caused by:

The gene has 0 counts in all time points.
The gene has 0 in some of the time points

Here is a snapshot of genes (counts) with NA in padj column.

"A4GALT" 0 0 0 0 0 3 0 0 0 0
"A4GNT" 0 0 1 0 1 0 0 0 0 10
"AA06" 0 0 0 0 0 0 0 0 0 0
"AACS" 1 0 1 0 0 1 0 2 0 0
"AACSP1" 0 0 0 0 0 0 0 0 0 0
"AADAC" 0 0 1 0 0 0 0 0 0 1
"AADACL2" 0 0 0 0 0 0 0 0 2 0

But I guess this is just "normal".

Edit: I remember reading something about NAs in DESeq but cant find it anymore.