Question

Changes to dispersion estimates in DESeq2?

2

Entering edit mode

7.2 years ago

James Denvir ▴ 100

I analyzed an RNA-Seq experiment (from human samples) a little over a year ago for a colleague, using DESeq2. The analysis produced some good results (96 genes upregulated and 263 downregulated, using default settings) which made good biological sense.

The investigator wanted to add a couple more samples since the sample size was fairly small, so recently she sequenced two more pairs of samples and asked me to run the same analysis with the full set. As you can guess, all the nice results vanished.

To check the analysis, I omitted the new samples and re-ran the same analysis I had run previously, on the same samples, and got completely different results: 9 genes upregulated and 12 downregulated.

To make sure I was doing exactly the same thing both times, I loaded up the saved RData file from the analysis last year. This has the original summarized experiment (se), the DESeqDataSet (dds) after running DESeq(), and the results. I ran exactly the same commands I had run previously, creating a new DESeqDataSet from the same summarized experiement as follows:

dds_new <- DESeqDataSet(se, design = design(dds))
# I know this is a left-over habit from the original DESeq days, which I shouldn't still be using, however, I did it a year ago:
dds_new <-  dds_new[rowSums(counts(dds_new)) > 1,]
dds_new <- DESeq(dds_new)

and sure enough I have different results:

> summary(results(dds))

out of 35036 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 96, 0.27% 
LFC < 0 (down)   : 263, 0.75% 
outliers [1]     : 0, 0% 
low counts [2]   : 0, 0% 
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> summary(results(dds_new))

out of 35036 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 9, 0.026% 
LFC < 0 (down)   : 12, 0.034% 
outliers [1]     : 0, 0% 
low counts [2]   : 0, 0% 
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

I did some further checks, and the difference is in the dispersion estimates:

> summary(counts(dds)-counts(dds_new))
       V1          V2          V3          V4          V5          V6   
 Min.   :0   Min.   :0   Min.   :0   Min.   :0   Min.   :0   Min.   :0  
 1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0  
 Median :0   Median :0   Median :0   Median :0   Median :0   Median :0  
 Mean   :0   Mean   :0   Mean   :0   Mean   :0   Mean   :0   Mean   :0  
 3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0  
 Max.   :0   Max.   :0   Max.   :0   Max.   :0   Max.   :0   Max.   :0  
       V7          V8          V9         V10         V11         V12   
 Min.   :0   Min.   :0   Min.   :0   Min.   :0   Min.   :0   Min.   :0  
 1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0  
 Median :0   Median :0   Median :0   Median :0   Median :0   Median :0  
 Mean   :0   Mean   :0   Mean   :0   Mean   :0   Mean   :0   Mean   :0  
 3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0  
 Max.   :0   Max.   :0   Max.   :0   Max.   :0   Max.   :0   Max.   :0  
      V13         V14         V15         V16         V17   
 Min.   :0   Min.   :0   Min.   :0   Min.   :0   Min.   :0  
 1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0  
 Median :0   Median :0   Median :0   Median :0   Median :0  
 Mean   :0   Mean   :0   Mean   :0   Mean   :0   Mean   :0  
 3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0  
 Max.   :0   Max.   :0   Max.   :0   Max.   :0   Max.   :0  

> summary(counts(dds,normalized=T)-counts(dds_new, normalized=T))
       V1          V2          V3          V4          V5          V6   
 Min.   :0   Min.   :0   Min.   :0   Min.   :0   Min.   :0   Min.   :0  
 1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0  
 Median :0   Median :0   Median :0   Median :0   Median :0   Median :0  
 Mean   :0   Mean   :0   Mean   :0   Mean   :0   Mean   :0   Mean   :0  
 3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0  
 Max.   :0   Max.   :0   Max.   :0   Max.   :0   Max.   :0   Max.   :0  
       V7          V8          V9         V10         V11         V12   
 Min.   :0   Min.   :0   Min.   :0   Min.   :0   Min.   :0   Min.   :0  
 1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0  
 Median :0   Median :0   Median :0   Median :0   Median :0   Median :0  
 Mean   :0   Mean   :0   Mean   :0   Mean   :0   Mean   :0   Mean   :0  
 3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0  
 Max.   :0   Max.   :0   Max.   :0   Max.   :0   Max.   :0   Max.   :0  
      V13         V14         V15         V16         V17   
 Min.   :0   Min.   :0   Min.   :0   Min.   :0   Min.   :0  
 1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0  
 Median :0   Median :0   Median :0   Median :0   Median :0  
 Mean   :0   Mean   :0   Mean   :0   Mean   :0   Mean   :0  
 3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0  
 Max.   :0   Max.   :0   Max.   :0   Max.   :0   Max.   :0

> sizeFactors(dds)-sizeFactors(dds_new)
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

> summary(dispersions(dds)-dispersions(dds_new))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-16.67847  -0.49073  -0.03083  -1.06727   0.00000   8.96455

From my log files (which I am pretty certain were generated when I ran this a year ago), I have the following result of sessionInfo():

R version 3.2.1 (2015-06-18)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    methods   stats     graphics  grDevices utils    
[8] datasets  base     

other attached packages:
 [1] org.Hs.eg.db_3.2.3         RSQLite_1.0.0             
 [3] DBI_0.4                    RColorBrewer_1.1-2        
 [5] pheatmap_1.0.8             DESeq2_1.10.1             
 [7] RcppArmadillo_0.5.500.2.0  Rcpp_0.12.4               
 [9] GenomicAlignments_1.6.3    SummarizedExperiment_1.0.2
[11] GenomicFeatures_1.22.13    AnnotationDbi_1.32.3      
[13] Biobase_2.30.0             Rsamtools_1.22.0          
[15] Biostrings_2.38.4          XVector_0.10.0            
[17] GenomicRanges_1.22.4       GenomeInfoDb_1.6.3        
[19] IRanges_2.4.8              S4Vectors_0.8.11          
[21] BiocGenerics_0.16.1       

loaded via a namespace (and not attached):
 [1] futile.logger_1.4.1  plyr_1.8.3           bitops_1.0-6        
 [4] futile.options_1.0.0 tools_3.2.1          zlibbioc_1.16.0     
 [7] digest_0.6.9         rpart_4.1-10         biomaRt_2.26.1      
[10] annotate_1.48.0      gtable_0.2.0         lattice_0.20-33     
[13] Matrix_1.2-6         gridExtra_2.2.1      genefilter_1.52.1   
[16] cluster_2.0.4        rtracklayer_1.30.4   locfit_1.5-9.1      
[19] nnet_7.3-12          grid_3.2.1           data.table_1.9.6    
[22] XML_3.98-1.4         survival_2.39-2      BiocParallel_1.4.3  
[25] foreign_0.8-66       latticeExtra_0.6-28  Formula_1.2-1       
[28] geneplotter_1.48.0   ggplot2_2.1.0        lambda.r_1.1.7      
[31] Hmisc_3.17-4         scales_0.4.0         splines_3.2.1       
[34] xtable_1.8-2         colorspace_1.2-6     labeling_0.3        
[37] acepack_1.3-3.3      RCurl_1.95-4.8       munsell_0.4.3       
[40] chron_2.3-47

and from my current session, I have

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Scientific Linux 7.3 (Nitrogen)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] DESeq2_1.14.1              SummarizedExperiment_1.4.0
[3] Biobase_2.34.0             GenomicRanges_1.26.4      
[5] GenomeInfoDb_1.10.3        IRanges_2.8.2             
[7] S4Vectors_0.12.2           BiocGenerics_0.20.0       

loaded via a namespace (and not attached):
 [1] genefilter_1.56.0        locfit_1.5-9.1           splines_3.4.1           
 [4] lattice_0.20-35          colorspace_1.3-2         htmltools_0.3.6         
 [7] rtracklayer_1.34.2       GenomicFeatures_1.26.4   base64enc_0.1-3         
[10] blob_1.1.0               survival_2.41-3          XML_3.98-1.9            
[13] rlang_0.1.1              foreign_0.8-69           DBI_0.7                 
[16] BiocParallel_1.10.1      bit64_0.9-7              RColorBrewer_1.1-2      
[19] plyr_1.8.4               stringr_1.2.0            zlibbioc_1.20.0         
[22] Biostrings_2.42.1        munsell_0.4.3            gtable_0.2.0            
[25] htmlwidgets_0.9          memoise_1.1.0            latticeExtra_0.6-28     
[28] knitr_1.16               biomaRt_2.30.0           geneplotter_1.52.0      
[31] AnnotationDbi_1.36.2     htmlTable_1.9            Rcpp_0.12.12            
[34] acepack_1.4.1            xtable_1.8-2             scales_0.4.1            
[37] backports_1.1.0          checkmate_1.8.3          Hmisc_4.0-3             
[40] annotate_1.52.1          XVector_0.14.1           bit_1.1-12              
[43] Rsamtools_1.26.2         gridExtra_2.2.1          ggplot2_2.2.1           
[46] digest_0.6.12            stringi_1.1.5            grid_3.4.1              
[49] bitops_1.0-6             tools_3.4.1              magrittr_1.5            
[52] lazyeval_0.2.0           RCurl_1.95-4.8           tibble_1.3.3            
[55] RSQLite_2.0              Formula_1.2-2            cluster_2.0.6           
[58] Matrix_1.2-10            data.table_1.10.4        rpart_4.1-11            
[61] GenomicAlignments_1.10.1 nnet_7.3-12              compiler_3.4.1

So my questions are:

has anything changed between DESeq2 versions 1.10.1 and 1.14.1 that would cause this change in dispersion estimates (which is pretty dramatic)?
is there anything else I can check from the data I have saved to see what might be different between the two analyses?

RNA-Seq DESeq2 • 4.4k views

ADD COMMENT • link updated 7.2 years ago by Science_Robot ★ 1.1k • written 7.2 years ago by James Denvir ▴ 100

4

Entering edit mode

You deserve a lot of credit for keeping your previous results well preserved.

ADD REPLY • link 7.2 years ago by glocke01 ▴ 190

2

Entering edit mode

@glocke01 Thanks: I am quite picky about that kind of thing. Unfortunately, it doesn't seem to have helped a whole lot in this case...

ADD REPLY • link 7.2 years ago by James Denvir ▴ 100

1

Entering edit mode

Tagging: Michael Love

He has not been by for a couple of weeks but he eventually will.

For faster resolution post this on the bioconductor support site for DESeq2 as was suggested earlier.

ADD REPLY • link 7.2 years ago by GenoMax 147k

0

Entering edit mode

has anything changed between DESeq2 versions 1.10.1 and 1.14.1 that would cause this change in dispersion estimates (which is pretty dramatic)?

You are more likely to get a response asking at the BioConductor forum, as DESeq2 authors are very active there. If / when you cross post, apologize for cross-posting, justify you did this following some random dude suggestion, and link to this thread.

But anyway, it is possible something changed inside DESeq2 which would cause this kind of change, I had a similar experience with edgeR versions.

ADD REPLY • link 7.2 years ago by h.mon 35k

0

Entering edit mode

Yes, I believe that they recently modified how the design formulae function, but cannot be sure. What is your design formula? - what type of data does it include?

ADD REPLY • link 7.2 years ago by Kevin Blighe 88k

0

Entering edit mode

@KevinBlighe I can't give all the details as it's essentially someone else's project. But there are "before treatment" and "after treatment" samples from two different treatment groups. The design formula is essentially ~ timepoint + treatment + sampleID.treatment + timepoint:treatment (i.e. I'm interested in the interaction term: I want to know how the treatment affects the before -> after change); the sampleID.treatment term is essentially the ind.n term from the section in the vignette "Group-specific condition effects, individuals nested within groups"; timepoint and treatment are both two-level factors.

ADD REPLY • link 7.2 years ago by James Denvir ▴ 100

2

Entering edit mode

Hey James, I believe that it's the interaction terms that they specifically changed in the 1.14.1 version, but I honestly cannot be certain. I recall requiring an interaction term in a recent study but I noticed the different behaviour of DESeq2 with this in the design model. I then decided to just manually install version 1.10.1 because I was more familiar with that version.

You may want to take a look at these threads on the Bioconductor forum: https://support.bioconductor.org/p/83645/ https://support.bioconductor.org/p/73676/#73694

You may want to look at the list of updates to each version in the NEWS file too: https://github.com/Bioconductor-mirror/DESeq2/blob/master/NEWS

ADD REPLY • link 7.2 years ago by Kevin Blighe 88k

1

Entering edit mode

That's really helpful, thanks. I think my next step is to try to recreate an environment with 1.10.1 and make sure I can recreate the previous results with that. (If so, we then have to decide what it means that we get one set of results with one version and a different set with a different version. But at least I have an idea what may be happening now, and a way forward.)

ADD REPLY • link 7.2 years ago by James Denvir ▴ 100

3

Entering edit mode

I think the relevant change was to stabilize the estimates of mu_ij that's listed in NEWS between 1.10 and 1.14. This could then have an effect downstream, modifying the dispersion estimates. I always recommend users to not switch versions once they've started an analysis and are wedded in some way to the output, as the software implements a complex and interdependent set of methods.

I know it seems like the numbers change a lot but it's often a relatively small change to the test statistic which seems big after thresholding at a significance level, which we do when we build FDR sets. The Wald statistics are probably highly correlated here.

I wrote up an explanation of this one time on the support site:

https://support.bioconductor.org/p/73676/#73694

ADD REPLY • link 7.2 years ago by Michael Love ★ 2.6k

1

Entering edit mode

Michael, thanks for the response. I had come to the same conclusion following the links Kevin provided.

We run a small core facility and use our own centralized hardware, so not switching versions once we've started an analysis presents something of a management challenge. (Basically, we always have several ongoing projects, so we'd never get a chance to upgrade.) I've been thinking about virtualization/containerization for the environment for each individual project for a while, so perhaps this is the final incentive I need to start to organize things along those lines.

ADD REPLY • link 7.2 years ago by James Denvir ▴ 100

2

Entering edit mode

Or you could use modules. This allows one to keep multiple versions of software around and use what you need as you need it.

ADD REPLY • link 7.2 years ago by GenoMax 147k

0

Entering edit mode

That works quite well from my experience of using shared environments - I've been on both sides as admin and also end user. As admin, you basically install all versions of the program of interest, and then end-user can then choose the one that want to load into their working environment.

ADD REPLY • link 7.2 years ago by Kevin Blighe 88k

1

Entering edit mode

At four institutes I've worked at there were multiple R versions on the cluster available via e.g. 'module load r-3.4.1'

This solves most issues as then users can have Bioconductor packages in personal library locations which won't change version over time (as it's linked to the R version), unless the user chooses.

I don't know about how hard this is on the system administrators as far as keeping a backwards compatibile system though.

ADD REPLY • link 7.2 years ago by Michael Love ★ 2.6k

0

Entering edit mode

Do you have a link that references those changes anywhere, btw?

ADD REPLY • link 7.2 years ago by James Denvir ▴ 100

0

Entering edit mode

Yep, see comment above!

ADD REPLY • link 7.2 years ago by Kevin Blighe 88k

score 0 · Answer 1 · 2017-09-29

I would try to reproduce your issue by installing either version of DESeq2 and running it with your data to see if you get the same results. I'm not sure where the official source code repository lives or if Bioconductor or CRAN store old versions of packages. If you manage to reproduce the different results between versions, you could try emailing the authors to see if they have any insights. Otherwise, you're probably stuck digging through a diff of the code of different versions of DESeq2 to find the source of the problem. I would look for changes in the default parameters first.