Question

RNA-Seq time series analysis using a DESeq2 spline approach yields far too many significant genes

2

Entering edit mode

5.9 years ago

stu111538 ▴ 80

Hi,

I performed RNA-Seq time series analysis on my set of 12 patients under therapy. Since I only have 5 time points (and for some patients only 4 time points), I decided not to use ImpulseDE2 software, but the DESeq2 spline approach (using 3 splines) the ImpulseDE2 developers used in their benchmark study in comparison to their software (https://academic.oup.com/nar/article/46/20/e119/5068248). They provided the code with their publication (SuppData3_DESeq2_for_timecourse_data.html).

My set up: 12 patients under therapy, 5 time points. Two conditions: lesional and non-lesional skin biopsies. My research question: I am seeking transcripts that show different courses over time in lesional and non-lesional biopsies. For that, I performed a case-control approach according to the ImpulseDE2 developers.

I performed batch correction. All time points from one patient present one batch.

My analysis results in too many significantly expressed genes (15,667!). Thus, I do not trust my analysis. Does anybody has suggestions or concerns what should or must be changed? The code? The set-up? The number of splines? Or do I have to live with many false positives in this kind of analysis? Should I perform a stricter BH p-value threshold than 0.05? I appreciate every hint or suggestion!

This is my code:

# Generate count matrix
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design = ~ patient) 
ddsHTSeq <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ]
matCountData <-  counts(ddsHTSeq, normalized=FALSE)
# this is the annotation table
dfAnnotationDESeq2
                  Sample Condition Time Batch
1      1_0h_L      case    0     1
2      1_4h_L      case    4     1
3     1_24h_L      case   24     1
4      1_2w_L      case  336     1
5     1_0h_NL   control    0     1
6     1_4h_NL   control    4     1
7    1_24h_NL   control   24     1
8     1_2w_NL   control  336     1
9      2_0h_L      case    0     2
10     2_4h_L      case    4     2
11    2_24h_L      case   24     2
12     2_2w_L      case  336     2
13    2_0h_NL   control    0     2
14    2_4h_NL   control    4     2
15   2_24h_NL   control   24     2
16    2_2w_NL   control  336     2
17     3_0h_L      case    0     3
18     3_4h_L      case    4     3
19    3_24h_L      case   24     3
20     3_2w_L      case  336     3
21    3_0h_NL   control    0     3
22    3_4h_NL   control    4     3
23   3_24h_NL   control   24     3
24    3_2w_NL   control  336     3
25     4_0h_L      case    0     4
26     4_4h_L      case    4     4
27    4_24h_L      case   24     4
28     4_2w_L      case  336     4
29     4_6w_L      case 1008     4
30    4_0h_NL   control    0     4
31    4_4h_NL   control    4     4
32   4_24h_NL   control   24     4
33    4_2w_NL   control  336     4
34    4_6w_NL   control 1008     4
35     5_0h_L      case    0     5
36     5_4h_L      case    4     5
37     5_2w_L      case  336     5
38     5_6w_L      case 1008     5
39    5_0h_NL   control    0     5
40    5_4h_NL   control    4     5
41    5_2w_NL   control  336     5
42    5_6w_NL   control 1008     5
43     7_0h_L      case    0     7
44     7_4h_L      case    4     7
45    7_24h_L      case   24     7
46     7_2w_L      case  336     7
47     7_6w_L      case 1008     7
48    7_0h_NL   control    0     7
49    7_4h_NL   control    4     7
50   7_24h_NL   control   24     7
51    7_2w_NL   control  336     7
52    7_6w_NL   control 1008     7
53     8_0h_L      case    0     8
54     8_4h_L      case    4     8
55    8_24h_L      case   24     8
56     8_2w_L      case  336     8
57     8_6w_L      case 1008     8
58    8_0h_NL   control    0     8
59    8_4h_NL   control    4     8
60   8_24h_NL   control   24     8
61    8_2w_NL   control  336     8
62    8_6w_NL   control 1008     8
63     9_0h_L      case    0     9
64     9_4h_L      case    4     9
65    9_24h_L      case   24     9
66     9_2w_L      case  336     9
67     9_6w_L      case 1008     9
68    9_0h_NL   control    0     9
69    9_4h_NL   control    4     9
70   9_24h_NL   control   24     9
71    9_2w_NL   control  336     9
72    9_6w_NL   control 1008     9
73    10_0h_L      case    0    10
74    10_4h_L      case    4    10
75   10_24h_L      case   24    10
76    10_2w_L      case  336    10
77   10_0h_NL   control    0    10
78   10_4h_NL   control    4    10
79  10_24h_NL   control   24    10
80   10_2w_NL   control  336    10
81   10_6w_NL   control 1008    10
82    11_0h_L      case    0    11
83    11_4h_L      case    4    11
84   11_24h_L      case   24    11
85    11_2w_L      case  336    11
86    11_6w_L      case 1008    11
87   11_0h_NL   control    0    11
88   11_4h_NL   control    4    11
89  11_24h_NL   control   24    11
90   11_2w_NL   control  336    11
91   11_6w_NL   control 1008    11
92    12_0h_L      case    0    12
93    12_4h_L      case    4    12
94   12_24h_L      case   24    12
95    12_2w_L      case  336    12
96    12_6w_L      case 1008    12
97   12_0h_NL   control    0    12
98   12_4h_NL   control    4    12
99   12_2w_NL   control  336    12
100  12_6w_NL   control 1008    12
101   13_0h_L      case    0    13
102   13_4h_L      case    4    13
103  13_24h_L      case   24    13
104   13_2w_L      case  336    13
105   13_6w_L      case 1008    13
106  13_0h_NL   control    0    13
107  13_4h_NL   control    4    13
108 13_24h_NL   control   24    13
109  13_2w_NL   control  336    13
110  13_6w_NL   control 1008    13

# create a natural cubic spline basis with 3 degrees of freedom
matTimeSplineBasis <- ns(
  dfAnnotationDESeq2$Time, df=3)
colnames(matTimeSplineBasis) <- 
  paste0("spline", seq(1, dim(matTimeSplineBasis)[2]))
# add spline basis into annotation data frame
dfAnnotationDESeq2 <- cbind(dfAnnotationDESeq2, matTimeSplineBasis)
# create DESeq2 object with spline basis vectors as individual linear predictors
dds <- DESeqDataSetFromMatrix(
  countData = matObservedCounts,
  colData = dfAnnotationDESeq2,
  design = ~Condition + Condition:spline1 + Condition:spline2 + 
    Condition:spline3 + Condition:Batch)

dds <- estimateSizeFactors(dds)
dds <- estimateDispersionsGeneEst(dds)
dds <- estimateDispersionsFit(dds)
dds <- estimateDispersionsMAP(dds, outlierSD = 10)
# perform a log-likelihood ratio test
dds <- nbinomLRT(dds, full = ~Condition + Condition:spline1 + Condition:spline2 + Condition:spline3 + Condition:Batch, reduced = ~spline1 + spline2 + spline3 + Batch)

121 rows did not converge in beta, labelled in mcols(object)$fullBetaConv. Use larger maxit argument with nbinomLRT

ddsClean <- dds[which(mcols(dds)$fullBetaConv),]
dds <- ddsClean

# extract results
res <- results(dds, independentFiltering = TRUE, alpha = 0.05)
summary(res)

    out of 38766 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 6160, 16% 
LFC < 0 (down)   : 9507, 25% 
outliers [1]     : 0, 0% 
low counts [2]   : 14280, 37% 
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

res
log2 fold change (MLE): Conditioncontrol.Batch9 
LRT p-value: '~ Condition + Condition:spline1 + Condition:spline2 + Condition:spline3 + Condition:Batch' vs '~ spline1 + spline2 + spline3 + Batch' 
DataFrame with 38766 rows and 6 columns
                   baseMean log2FoldChange      lfcSE        stat       pvalue         padj
                  <numeric>      <numeric>  <numeric>   <numeric>    <numeric>    <numeric>
ENSG00000000003  149.481160     -0.6514009  0.5537688    59.83414 2.693173e-07 1.029105e-06
ENSG00000000005    6.546868     -2.0796292  0.9404088    92.89478 2.851488e-13 2.275800e-12
ENSG00000000419  206.219790     -1.8657341  0.8252927    38.40941 7.847144e-04 1.685780e-03
ENSG00000000457  163.179909     -0.7362316  0.4296248    49.73745 1.329018e-05 3.859843e-05
ENSG00000000460   52.988196     -0.7491811  0.3855136    24.80852 5.257488e-02 7.859270e-02
...                     ...            ...        ...         ...          ...          ...
ENSG00000284740 8.948296567      4.2980789  0.9467248 23.20503614   0.07988253    0.1146611
ENSG00000284744 0.162695127      0.8713780 10.3012824  0.50530215   1.00000000           NA
ENSG00000284745 0.006574792      0.7823253 10.3013237  0.02428277   1.00000000           NA
ENSG00000284747 5.979928081      1.7476591  0.7738713 18.21767274   0.25139486    0.3232673
ENSG00000284748 0.090656013      0.8454774 10.3012990  0.08209482   1.00000000           NA


sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8
 [6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] parallel  stats4    splines   stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] ggplot2_3.1.0              BiocParallel_1.12.0        DESeq2_1.18.1              SummarizedExperiment_1.8.1 DelayedArray_0.4.1
 [6] matrixStats_0.54.0         Biobase_2.38.0             GenomicRanges_1.30.3       GenomeInfoDb_1.14.0        IRanges_2.12.0
[11] S4Vectors_0.16.0           BiocGenerics_0.24.0        ImpulseDE2_1.2.0

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            Formula_1.2-3          assertthat_0.2.0       latticeExtra_0.6-28    blob_1.1.1             GenomeInfoDbData_1.0.0
 [7] RSQLite_2.1.1          pillar_1.3.1           backports_1.1.2        lattice_0.20-35        glue_1.3.0             digest_0.6.18
[13] RColorBrewer_1.1-2     XVector_0.18.0         checkmate_1.8.5        colorspace_1.3-2       cowplot_0.9.3          htmltools_0.3.6
[19] Matrix_1.2-14          plyr_1.8.4             XML_3.98-1.16          pkgconfig_2.0.2        GetoptLong_0.1.7       genefilter_1.60.0
[25] zlibbioc_1.24.0        purrr_0.2.5            xtable_1.8-2           scales_1.0.0           htmlTable_1.12         tibble_1.4.2
[31] annotate_1.56.2        withr_2.1.2            nnet_7.3-12            lazyeval_0.2.1         survival_2.43-3        magrittr_1.5
[37] crayon_1.3.4           memoise_1.1.0          foreign_0.8-71         tools_3.4.4            data.table_1.11.4      GlobalOptions_0.1.0
[43] ComplexHeatmap_1.17.1  stringr_1.3.1          locfit_1.5-9.1         munsell_0.5.0          cluster_2.0.7-1        AnnotationDbi_1.40.0
[49] bindrcpp_0.2.2         compiler_3.4.4         rlang_0.3.0.1          grid_3.4.4             RCurl_1.95-4.11        rstudioapi_0.7
[55] rjson_0.2.20           htmlwidgets_1.2        circlize_0.4.5         bitops_1.0-6           base64enc_0.1-3        gtable_0.2.0
[61] DBI_1.0.0              R6_2.2.2               gridExtra_2.3          knitr_1.20             dplyr_0.7.8            bit_1.1-14
[67] bindr_0.1.1            Hmisc_4.1-1            shape_1.4.4            stringi_1.2.4          Rcpp_1.0.0             geneplotter_1.56.0
[73] rpart_4.1-13           acepack_1.4.1          tidyselect_0.2.5

splines DESeq2 time-series RNA-Seq • 5.3k views

ADD COMMENT • link updated 18 months ago by Ram 44k • written 5.9 years ago by stu111538 ▴ 80

0

Entering edit mode

You're working with big data and getting such a result is not odd. I think 15000 DE genes is normal for RNA-seq data analysis

ADD REPLY • link 5.9 years ago by seta ★ 1.9k

2

Entering edit mode

I do not think that one would expect 40% of all tested genes being significant in my setting comparing to types of skin biopsies with each other. Maybe if you compare different cell types or tumor vs. healthy cells.

ADD REPLY • link 5.9 years ago by stu111538 ▴ 80

2

Entering edit mode

With all due respect, I disagree - I think 15,000 differentially expressed genes seems high (I usually expect getting closer to 1,500 genes is better for specific enrichment).

In general, I think you need to test and critically assess different methods for each project. The context is a little different, but perhaps A: How to see if adjusting batch effect in RNA-seq is working or not includes some possibly useful troubleshooting ideas?

While there could be other factors besides the differential expression playing a role, maybe this can give you a start. To be honest, it is a little hard for me to look through the details of your code (and I think I usually have to give general indirect guidance on the forum). However, if you can use your visual concordance between your replicates as a justification for the imprecision of the p-value (and you get large genes with multiple methods, such as edgeR, limma-voom, and DESeq2, testing results with and without extra variables), perhaps using a very low FDR can help narrow down a smaller list of candidate genes?

Also, I typically use a combination of p-value/FDR and fold-change thresholds (and sometimes the X% FPKM level). If the majority of the genes have less than a 50% change, I would focus on those with a greater magnitude of difference (and possibly minimum/average/maximum expression level)

ADD REPLY • link 5.9 years ago by Charles Warden 8.3k

0

Entering edit mode

Thank you Charles Warden for your comment! I will try other approaches to see whether there are just that many differences between my two conditions. Thanks for the link, I though about batch effects, but actually I did not see batches in a PCA. In this special case, I cannot use fold-change as further filtering or relevance criterion, because I performed a time series approach. Genes have a low p-value if they show differences over time and not to a particular time point. Thus, the p-value was calculated for a set of fold-changes.

ADD REPLY • link 5.9 years ago by stu111538 ▴ 80

1

Entering edit mode

If the increase or decrease is continual/linear, you can look at the correlation coefficient (such as looking at |correlation| > 0.2).

If the trend is non-linear, it is harder to give specific guidance (but that could be captured with a two-variable model for group and time point, without the splines). However, you could at least visually check that there replicates around the time range with the greatest change (to justify that difference is not due to one outlier sample).