Hi,
I performed RNA-Seq time series analysis on my set of 12 patients under therapy. Since I only have 5 time points (and for some patients only 4 time points), I decided not to use ImpulseDE2 software, but the DESeq2 spline approach (using 3 splines) the ImpulseDE2 developers used in their benchmark study in comparison to their software (https://academic.oup.com/nar/article/46/20/e119/5068248). They provided the code with their publication (SuppData3_DESeq2_for_timecourse_data.html).
My set up: 12 patients under therapy, 5 time points. Two conditions: lesional and non-lesional skin biopsies. My research question: I am seeking transcripts that show different courses over time in lesional and non-lesional biopsies. For that, I performed a case-control approach according to the ImpulseDE2 developers.
I performed batch correction. All time points from one patient present one batch.
My analysis results in too many significantly expressed genes (15,667!). Thus, I do not trust my analysis. Does anybody has suggestions or concerns what should or must be changed? The code? The set-up? The number of splines? Or do I have to live with many false positives in this kind of analysis? Should I perform a stricter BH p-value threshold than 0.05? I appreciate every hint or suggestion!
This is my code:
# Generate count matrix
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design = ~ patient)
ddsHTSeq <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ]
matCountData <- counts(ddsHTSeq, normalized=FALSE)
# this is the annotation table
dfAnnotationDESeq2
Sample Condition Time Batch
1 1_0h_L case 0 1
2 1_4h_L case 4 1
3 1_24h_L case 24 1
4 1_2w_L case 336 1
5 1_0h_NL control 0 1
6 1_4h_NL control 4 1
7 1_24h_NL control 24 1
8 1_2w_NL control 336 1
9 2_0h_L case 0 2
10 2_4h_L case 4 2
11 2_24h_L case 24 2
12 2_2w_L case 336 2
13 2_0h_NL control 0 2
14 2_4h_NL control 4 2
15 2_24h_NL control 24 2
16 2_2w_NL control 336 2
17 3_0h_L case 0 3
18 3_4h_L case 4 3
19 3_24h_L case 24 3
20 3_2w_L case 336 3
21 3_0h_NL control 0 3
22 3_4h_NL control 4 3
23 3_24h_NL control 24 3
24 3_2w_NL control 336 3
25 4_0h_L case 0 4
26 4_4h_L case 4 4
27 4_24h_L case 24 4
28 4_2w_L case 336 4
29 4_6w_L case 1008 4
30 4_0h_NL control 0 4
31 4_4h_NL control 4 4
32 4_24h_NL control 24 4
33 4_2w_NL control 336 4
34 4_6w_NL control 1008 4
35 5_0h_L case 0 5
36 5_4h_L case 4 5
37 5_2w_L case 336 5
38 5_6w_L case 1008 5
39 5_0h_NL control 0 5
40 5_4h_NL control 4 5
41 5_2w_NL control 336 5
42 5_6w_NL control 1008 5
43 7_0h_L case 0 7
44 7_4h_L case 4 7
45 7_24h_L case 24 7
46 7_2w_L case 336 7
47 7_6w_L case 1008 7
48 7_0h_NL control 0 7
49 7_4h_NL control 4 7
50 7_24h_NL control 24 7
51 7_2w_NL control 336 7
52 7_6w_NL control 1008 7
53 8_0h_L case 0 8
54 8_4h_L case 4 8
55 8_24h_L case 24 8
56 8_2w_L case 336 8
57 8_6w_L case 1008 8
58 8_0h_NL control 0 8
59 8_4h_NL control 4 8
60 8_24h_NL control 24 8
61 8_2w_NL control 336 8
62 8_6w_NL control 1008 8
63 9_0h_L case 0 9
64 9_4h_L case 4 9
65 9_24h_L case 24 9
66 9_2w_L case 336 9
67 9_6w_L case 1008 9
68 9_0h_NL control 0 9
69 9_4h_NL control 4 9
70 9_24h_NL control 24 9
71 9_2w_NL control 336 9
72 9_6w_NL control 1008 9
73 10_0h_L case 0 10
74 10_4h_L case 4 10
75 10_24h_L case 24 10
76 10_2w_L case 336 10
77 10_0h_NL control 0 10
78 10_4h_NL control 4 10
79 10_24h_NL control 24 10
80 10_2w_NL control 336 10
81 10_6w_NL control 1008 10
82 11_0h_L case 0 11
83 11_4h_L case 4 11
84 11_24h_L case 24 11
85 11_2w_L case 336 11
86 11_6w_L case 1008 11
87 11_0h_NL control 0 11
88 11_4h_NL control 4 11
89 11_24h_NL control 24 11
90 11_2w_NL control 336 11
91 11_6w_NL control 1008 11
92 12_0h_L case 0 12
93 12_4h_L case 4 12
94 12_24h_L case 24 12
95 12_2w_L case 336 12
96 12_6w_L case 1008 12
97 12_0h_NL control 0 12
98 12_4h_NL control 4 12
99 12_2w_NL control 336 12
100 12_6w_NL control 1008 12
101 13_0h_L case 0 13
102 13_4h_L case 4 13
103 13_24h_L case 24 13
104 13_2w_L case 336 13
105 13_6w_L case 1008 13
106 13_0h_NL control 0 13
107 13_4h_NL control 4 13
108 13_24h_NL control 24 13
109 13_2w_NL control 336 13
110 13_6w_NL control 1008 13
# create a natural cubic spline basis with 3 degrees of freedom
matTimeSplineBasis <- ns(
dfAnnotationDESeq2$Time, df=3)
colnames(matTimeSplineBasis) <-
paste0("spline", seq(1, dim(matTimeSplineBasis)[2]))
# add spline basis into annotation data frame
dfAnnotationDESeq2 <- cbind(dfAnnotationDESeq2, matTimeSplineBasis)
# create DESeq2 object with spline basis vectors as individual linear predictors
dds <- DESeqDataSetFromMatrix(
countData = matObservedCounts,
colData = dfAnnotationDESeq2,
design = ~Condition + Condition:spline1 + Condition:spline2 +
Condition:spline3 + Condition:Batch)
dds <- estimateSizeFactors(dds)
dds <- estimateDispersionsGeneEst(dds)
dds <- estimateDispersionsFit(dds)
dds <- estimateDispersionsMAP(dds, outlierSD = 10)
# perform a log-likelihood ratio test
dds <- nbinomLRT(dds, full = ~Condition + Condition:spline1 + Condition:spline2 + Condition:spline3 + Condition:Batch, reduced = ~spline1 + spline2 + spline3 + Batch)
121 rows did not converge in beta, labelled in mcols(object)$fullBetaConv. Use larger maxit argument with nbinomLRT
ddsClean <- dds[which(mcols(dds)$fullBetaConv),]
dds <- ddsClean
# extract results
res <- results(dds, independentFiltering = TRUE, alpha = 0.05)
summary(res)
out of 38766 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 6160, 16%
LFC < 0 (down) : 9507, 25%
outliers [1] : 0, 0%
low counts [2] : 14280, 37%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
res
log2 fold change (MLE): Conditioncontrol.Batch9
LRT p-value: '~ Condition + Condition:spline1 + Condition:spline2 + Condition:spline3 + Condition:Batch' vs '~ spline1 + spline2 + spline3 + Batch'
DataFrame with 38766 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003 149.481160 -0.6514009 0.5537688 59.83414 2.693173e-07 1.029105e-06
ENSG00000000005 6.546868 -2.0796292 0.9404088 92.89478 2.851488e-13 2.275800e-12
ENSG00000000419 206.219790 -1.8657341 0.8252927 38.40941 7.847144e-04 1.685780e-03
ENSG00000000457 163.179909 -0.7362316 0.4296248 49.73745 1.329018e-05 3.859843e-05
ENSG00000000460 52.988196 -0.7491811 0.3855136 24.80852 5.257488e-02 7.859270e-02
... ... ... ... ... ... ...
ENSG00000284740 8.948296567 4.2980789 0.9467248 23.20503614 0.07988253 0.1146611
ENSG00000284744 0.162695127 0.8713780 10.3012824 0.50530215 1.00000000 NA
ENSG00000284745 0.006574792 0.7823253 10.3013237 0.02428277 1.00000000 NA
ENSG00000284747 5.979928081 1.7476591 0.7738713 18.21767274 0.25139486 0.3232673
ENSG00000284748 0.090656013 0.8454774 10.3012990 0.08209482 1.00000000 NA
sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8
[6] LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 splines stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.1.0 BiocParallel_1.12.0 DESeq2_1.18.1 SummarizedExperiment_1.8.1 DelayedArray_0.4.1
[6] matrixStats_0.54.0 Biobase_2.38.0 GenomicRanges_1.30.3 GenomeInfoDb_1.14.0 IRanges_2.12.0
[11] S4Vectors_0.16.0 BiocGenerics_0.24.0 ImpulseDE2_1.2.0
loaded via a namespace (and not attached):
[1] bit64_0.9-7 Formula_1.2-3 assertthat_0.2.0 latticeExtra_0.6-28 blob_1.1.1 GenomeInfoDbData_1.0.0
[7] RSQLite_2.1.1 pillar_1.3.1 backports_1.1.2 lattice_0.20-35 glue_1.3.0 digest_0.6.18
[13] RColorBrewer_1.1-2 XVector_0.18.0 checkmate_1.8.5 colorspace_1.3-2 cowplot_0.9.3 htmltools_0.3.6
[19] Matrix_1.2-14 plyr_1.8.4 XML_3.98-1.16 pkgconfig_2.0.2 GetoptLong_0.1.7 genefilter_1.60.0
[25] zlibbioc_1.24.0 purrr_0.2.5 xtable_1.8-2 scales_1.0.0 htmlTable_1.12 tibble_1.4.2
[31] annotate_1.56.2 withr_2.1.2 nnet_7.3-12 lazyeval_0.2.1 survival_2.43-3 magrittr_1.5
[37] crayon_1.3.4 memoise_1.1.0 foreign_0.8-71 tools_3.4.4 data.table_1.11.4 GlobalOptions_0.1.0
[43] ComplexHeatmap_1.17.1 stringr_1.3.1 locfit_1.5-9.1 munsell_0.5.0 cluster_2.0.7-1 AnnotationDbi_1.40.0
[49] bindrcpp_0.2.2 compiler_3.4.4 rlang_0.3.0.1 grid_3.4.4 RCurl_1.95-4.11 rstudioapi_0.7
[55] rjson_0.2.20 htmlwidgets_1.2 circlize_0.4.5 bitops_1.0-6 base64enc_0.1-3 gtable_0.2.0
[61] DBI_1.0.0 R6_2.2.2 gridExtra_2.3 knitr_1.20 dplyr_0.7.8 bit_1.1-14
[67] bindr_0.1.1 Hmisc_4.1-1 shape_1.4.4 stringi_1.2.4 Rcpp_1.0.0 geneplotter_1.56.0
[73] rpart_4.1-13 acepack_1.4.1 tidyselect_0.2.5
You're working with big data and getting such a result is not odd. I think 15000 DE genes is normal for RNA-seq data analysis
I do not think that one would expect 40% of all tested genes being significant in my setting comparing to types of skin biopsies with each other. Maybe if you compare different cell types or tumor vs. healthy cells.
With all due respect, I disagree - I think 15,000 differentially expressed genes seems high (I usually expect getting closer to 1,500 genes is better for specific enrichment).
In general, I think you need to test and critically assess different methods for each project. The context is a little different, but perhaps A: How to see if adjusting batch effect in RNA-seq is working or not includes some possibly useful troubleshooting ideas?
While there could be other factors besides the differential expression playing a role, maybe this can give you a start. To be honest, it is a little hard for me to look through the details of your code (and I think I usually have to give general indirect guidance on the forum). However, if you can use your visual concordance between your replicates as a justification for the imprecision of the p-value (and you get large genes with multiple methods, such as edgeR, limma-voom, and DESeq2, testing results with and without extra variables), perhaps using a very low FDR can help narrow down a smaller list of candidate genes?
Also, I typically use a combination of p-value/FDR and fold-change thresholds (and sometimes the X% FPKM level). If the majority of the genes have less than a 50% change, I would focus on those with a greater magnitude of difference (and possibly minimum/average/maximum expression level)
Thank you Charles Warden for your comment! I will try other approaches to see whether there are just that many differences between my two conditions. Thanks for the link, I though about batch effects, but actually I did not see batches in a PCA. In this special case, I cannot use fold-change as further filtering or relevance criterion, because I performed a time series approach. Genes have a low p-value if they show differences over time and not to a particular time point. Thus, the p-value was calculated for a set of fold-changes.
If the increase or decrease is continual/linear, you can look at the correlation coefficient (such as looking at |correlation| > 0.2).
If the trend is non-linear, it is harder to give specific guidance (but that could be captured with a two-variable model for group and time point, without the splines). However, you could at least visually check that there replicates around the time range with the greatest change (to justify that difference is not due to one outlier sample).
Thank you! Yes, I considered a two-variable model already. I will try this.