Question

Clarification on how DSEeq2 Dispersion Curve is Generated

0

Entering edit mode

7.0 years ago

brismiller ▴ 60

Hi everyone,

I have a clarification question on how the average expression versus dispersion curve is generated. From the paper, it says that Deseq2 uses 'all samples' in making the plot, but is that all samples for a given sample type (genotype) or is it all samples regardless of genotype?

I am worried that gene dispersion information is being shared between genotypes, and I am wondering if this is valid. I understand that DESeq2 uses the correlation between average gene expression and dispersion for dispersion shrinkage, but does this assumption hold true between genotypes?

Quote from DESeq2 paper:

"Our DESeq method [4] detects and corrects dispersion estimates that are too low through modeling of the dependence of the dispersion on the average expression strength over all samples." Deseq2 Paper

RNA-Seq Deseq2 dispersion gene correction • 2.8k views

ADD COMMENT • link updated 7.0 years ago by Kevin Blighe 89k • written 7.0 years ago by brismiller ▴ 60

score 2 · Accepted Answer · 2018-04-24

Yes, from what I understand, DESeq2 does not fit group-specific dispersion estimates, i.e., the dispersion is calculated for each gene across all samples irrespective of what you specify in your design model. In very large datasets, it may be more intuitive to calculate dispersion across your groups of interest and apply weightings, whilst, for smaller datasets, trying to do this could really mess up your normalisation and, it follows, your statistical interpretations from the data.

The dispersion is calculated as:

variance / mean^2

...which is the same as CoV^2 (square coefficient of variation). See here: https://support.bioconductor.org/p/88880/

I have my own summary of how DESeq2 models dispersion:

Part I

Calculate the maximum-likelihood estimate (MLE) of dispersion for each gene in the dataset (black dots).
Model the MLEs (red curve)
From the model curve fit in 2, predict a value for each gene

Part II

Fit an empirical Bayes regression model to the MLEs and use the predicted values from the model curve fit in Step I, Part 3 (above) as the mean priors for each gene in the model. In empirical Bayesian statistics, by supplying 'priors' to the model, one is saying that these priors are the measured / empirical values and that we want to 'shrink' our current data to match the distribution of these priors.
Predict values from this model (blue dots) - these are the final dispersion estimates. What happens is that genes with lower counts have higher dispersion and are 'shrunk' more toward the red line than higher counts, which have lower dispersion.*”

Apparently that's my take. Also see that of the developer on this subject:

Kevin