I analyzed an RNA-Seq experiment (from human samples) a little over a year ago for a colleague, using DESeq2. The analysis produced some good results (96 genes upregulated and 263 downregulated, using default settings) which made good biological sense.
The investigator wanted to add a couple more samples since the sample size was fairly small, so recently she sequenced two more pairs of samples and asked me to run the same analysis with the full set. As you can guess, all the nice results vanished.
To check the analysis, I omitted the new samples and re-ran the same analysis I had run previously, on the same samples, and got completely different results: 9 genes upregulated and 12 downregulated.
To make sure I was doing exactly the same thing both times, I loaded up the saved RData file from the analysis last year. This has the original summarized experiment (se
), the DESeqDataSet (dds
) after running DESeq(), and the results. I ran exactly the same commands I had run previously, creating a new DESeqDataSet from the same summarized experiement as follows:
dds_new <- DESeqDataSet(se, design = design(dds))
# I know this is a left-over habit from the original DESeq days, which I shouldn't still be using, however, I did it a year ago:
dds_new <- dds_new[rowSums(counts(dds_new)) > 1,]
dds_new <- DESeq(dds_new)
and sure enough I have different results:
> summary(results(dds))
out of 35036 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 96, 0.27%
LFC < 0 (down) : 263, 0.75%
outliers [1] : 0, 0%
low counts [2] : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
> summary(results(dds_new))
out of 35036 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 9, 0.026%
LFC < 0 (down) : 12, 0.034%
outliers [1] : 0, 0%
low counts [2] : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
I did some further checks, and the difference is in the dispersion estimates:
> summary(counts(dds)-counts(dds_new))
V1 V2 V3 V4 V5 V6
Min. :0 Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
Median :0 Median :0 Median :0 Median :0 Median :0 Median :0
Mean :0 Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
Max. :0 Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
V7 V8 V9 V10 V11 V12
Min. :0 Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
Median :0 Median :0 Median :0 Median :0 Median :0 Median :0
Mean :0 Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
Max. :0 Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
V13 V14 V15 V16 V17
Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
Median :0 Median :0 Median :0 Median :0 Median :0
Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
> summary(counts(dds,normalized=T)-counts(dds_new, normalized=T))
V1 V2 V3 V4 V5 V6
Min. :0 Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
Median :0 Median :0 Median :0 Median :0 Median :0 Median :0
Mean :0 Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
Max. :0 Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
V7 V8 V9 V10 V11 V12
Min. :0 Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
Median :0 Median :0 Median :0 Median :0 Median :0 Median :0
Mean :0 Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
Max. :0 Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
V13 V14 V15 V16 V17
Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
Median :0 Median :0 Median :0 Median :0 Median :0
Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
> sizeFactors(dds)-sizeFactors(dds_new)
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> summary(dispersions(dds)-dispersions(dds_new))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-16.67847 -0.49073 -0.03083 -1.06727 0.00000 8.96455
From my log files (which I am pretty certain were generated when I ran this a year ago), I have the following result of sessionInfo()
:
R version 3.2.1 (2015-06-18)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 methods stats graphics grDevices utils
[8] datasets base
other attached packages:
[1] org.Hs.eg.db_3.2.3 RSQLite_1.0.0
[3] DBI_0.4 RColorBrewer_1.1-2
[5] pheatmap_1.0.8 DESeq2_1.10.1
[7] RcppArmadillo_0.5.500.2.0 Rcpp_0.12.4
[9] GenomicAlignments_1.6.3 SummarizedExperiment_1.0.2
[11] GenomicFeatures_1.22.13 AnnotationDbi_1.32.3
[13] Biobase_2.30.0 Rsamtools_1.22.0
[15] Biostrings_2.38.4 XVector_0.10.0
[17] GenomicRanges_1.22.4 GenomeInfoDb_1.6.3
[19] IRanges_2.4.8 S4Vectors_0.8.11
[21] BiocGenerics_0.16.1
loaded via a namespace (and not attached):
[1] futile.logger_1.4.1 plyr_1.8.3 bitops_1.0-6
[4] futile.options_1.0.0 tools_3.2.1 zlibbioc_1.16.0
[7] digest_0.6.9 rpart_4.1-10 biomaRt_2.26.1
[10] annotate_1.48.0 gtable_0.2.0 lattice_0.20-33
[13] Matrix_1.2-6 gridExtra_2.2.1 genefilter_1.52.1
[16] cluster_2.0.4 rtracklayer_1.30.4 locfit_1.5-9.1
[19] nnet_7.3-12 grid_3.2.1 data.table_1.9.6
[22] XML_3.98-1.4 survival_2.39-2 BiocParallel_1.4.3
[25] foreign_0.8-66 latticeExtra_0.6-28 Formula_1.2-1
[28] geneplotter_1.48.0 ggplot2_2.1.0 lambda.r_1.1.7
[31] Hmisc_3.17-4 scales_0.4.0 splines_3.2.1
[34] xtable_1.8-2 colorspace_1.2-6 labeling_0.3
[37] acepack_1.3-3.3 RCurl_1.95-4.8 munsell_0.4.3
[40] chron_2.3-47
and from my current session, I have
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Scientific Linux 7.3 (Nitrogen)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DESeq2_1.14.1 SummarizedExperiment_1.4.0
[3] Biobase_2.34.0 GenomicRanges_1.26.4
[5] GenomeInfoDb_1.10.3 IRanges_2.8.2
[7] S4Vectors_0.12.2 BiocGenerics_0.20.0
loaded via a namespace (and not attached):
[1] genefilter_1.56.0 locfit_1.5-9.1 splines_3.4.1
[4] lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6
[7] rtracklayer_1.34.2 GenomicFeatures_1.26.4 base64enc_0.1-3
[10] blob_1.1.0 survival_2.41-3 XML_3.98-1.9
[13] rlang_0.1.1 foreign_0.8-69 DBI_0.7
[16] BiocParallel_1.10.1 bit64_0.9-7 RColorBrewer_1.1-2
[19] plyr_1.8.4 stringr_1.2.0 zlibbioc_1.20.0
[22] Biostrings_2.42.1 munsell_0.4.3 gtable_0.2.0
[25] htmlwidgets_0.9 memoise_1.1.0 latticeExtra_0.6-28
[28] knitr_1.16 biomaRt_2.30.0 geneplotter_1.52.0
[31] AnnotationDbi_1.36.2 htmlTable_1.9 Rcpp_0.12.12
[34] acepack_1.4.1 xtable_1.8-2 scales_0.4.1
[37] backports_1.1.0 checkmate_1.8.3 Hmisc_4.0-3
[40] annotate_1.52.1 XVector_0.14.1 bit_1.1-12
[43] Rsamtools_1.26.2 gridExtra_2.2.1 ggplot2_2.2.1
[46] digest_0.6.12 stringi_1.1.5 grid_3.4.1
[49] bitops_1.0-6 tools_3.4.1 magrittr_1.5
[52] lazyeval_0.2.0 RCurl_1.95-4.8 tibble_1.3.3
[55] RSQLite_2.0 Formula_1.2-2 cluster_2.0.6
[58] Matrix_1.2-10 data.table_1.10.4 rpart_4.1-11
[61] GenomicAlignments_1.10.1 nnet_7.3-12 compiler_3.4.1
So my questions are:
- has anything changed between DESeq2 versions 1.10.1 and 1.14.1 that would cause this change in dispersion estimates (which is pretty dramatic)?
- is there anything else I can check from the data I have saved to see what might be different between the two analyses?
You deserve a lot of credit for keeping your previous results well preserved.
@glocke01 Thanks: I am quite picky about that kind of thing. Unfortunately, it doesn't seem to have helped a whole lot in this case...
Tagging: Michael Love
He has not been by for a couple of weeks but he eventually will.
For faster resolution post this on the bioconductor support site for DESeq2 as was suggested earlier.
You are more likely to get a response asking at the BioConductor forum, as DESeq2 authors are very active there. If / when you cross post, apologize for cross-posting, justify you did this following some random dude suggestion, and link to this thread.
But anyway, it is possible something changed inside DESeq2 which would cause this kind of change, I had a similar experience with edgeR versions.
Yes, I believe that they recently modified how the design formulae function, but cannot be sure. What is your design formula? - what type of data does it include?
@KevinBlighe I can't give all the details as it's essentially someone else's project. But there are "before treatment" and "after treatment" samples from two different treatment groups. The design formula is essentially
~ timepoint + treatment + sampleID.treatment + timepoint:treatment
(i.e. I'm interested in the interaction term: I want to know how the treatment affects the before -> after change); thesampleID.treatment
term is essentially theind.n
term from the section in the vignette "Group-specific condition effects, individuals nested within groups";timepoint
andtreatment
are both two-level factors.Hey James, I believe that it's the interaction terms that they specifically changed in the 1.14.1 version, but I honestly cannot be certain. I recall requiring an interaction term in a recent study but I noticed the different behaviour of DESeq2 with this in the design model. I then decided to just manually install version 1.10.1 because I was more familiar with that version.
You may want to take a look at these threads on the Bioconductor forum: https://support.bioconductor.org/p/83645/ https://support.bioconductor.org/p/73676/#73694
You may want to look at the list of updates to each version in the NEWS file too: https://github.com/Bioconductor-mirror/DESeq2/blob/master/NEWS
That's really helpful, thanks. I think my next step is to try to recreate an environment with 1.10.1 and make sure I can recreate the previous results with that. (If so, we then have to decide what it means that we get one set of results with one version and a different set with a different version. But at least I have an idea what may be happening now, and a way forward.)
I think the relevant change was to stabilize the estimates of mu_ij that's listed in NEWS between 1.10 and 1.14. This could then have an effect downstream, modifying the dispersion estimates. I always recommend users to not switch versions once they've started an analysis and are wedded in some way to the output, as the software implements a complex and interdependent set of methods.
I know it seems like the numbers change a lot but it's often a relatively small change to the test statistic which seems big after thresholding at a significance level, which we do when we build FDR sets. The Wald statistics are probably highly correlated here.
I wrote up an explanation of this one time on the support site:
https://support.bioconductor.org/p/73676/#73694
Michael, thanks for the response. I had come to the same conclusion following the links Kevin provided.
We run a small core facility and use our own centralized hardware, so not switching versions once we've started an analysis presents something of a management challenge. (Basically, we always have several ongoing projects, so we'd never get a chance to upgrade.) I've been thinking about virtualization/containerization for the environment for each individual project for a while, so perhaps this is the final incentive I need to start to organize things along those lines.
Or you could use modules. This allows one to keep multiple versions of software around and use what you need as you need it.
That works quite well from my experience of using shared environments - I've been on both sides as admin and also end user. As admin, you basically install all versions of the program of interest, and then end-user can then choose the one that want to load into their working environment.
At four institutes I've worked at there were multiple R versions on the cluster available via e.g. 'module load r-3.4.1'
This solves most issues as then users can have Bioconductor packages in personal library locations which won't change version over time (as it's linked to the R version), unless the user chooses.
I don't know about how hard this is on the system administrators as far as keeping a backwards compatibile system though.
Do you have a link that references those changes anywhere, btw?
Yep, see comment above!