Question

R DESeq: What exactly is Variance Stabilizating Transformation?

6

Entering edit mode

10.7 years ago

komal.rathi ★ 4.1k

I have been using the DESeq VST method on gene counts produced by Htseq-count as follows:

cds <- newCountDataSet(countData = dat, conditions = factor(conditions))
cds <- estimateSizeFactors(cds)
cds <- estimateDispersions(cds, sharingMode = "gene-est-only", method = "pooled", fitType = "local")
vst <- getVarianceStabilizedData(cds)

But honestly, I do not understand what exactly the getVarianceStabilizedData() function does. Can someone explain in simple terms:

Why is it necessary to normalize raw count data? Why can't we use the raw count data?
How exactly are we normalizing the raw count data using getVarianceStabilizedData() function?
Should the conditions parameter in the newCountDataSet() function match the conditions between which you want to find differentially expressed genes? For e.g. I have both cases & controls as well as males & females. So should I include both the information in the conditions parameter or just cases & controls?

I know these questions can be searched for easily, and I did. But I want a simple explanation from someone who uses these methods regularly to clear my concepts.

R DESeq VST • 31k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

For question 1, do you mean in the sense of variance stabilization or in the sense of library size? Also, have you read the DESeq paper (and the DESeq2 preprint, since you should switch to DESeq2 if possible)?

ADD REPLY • link 10.7 years ago by Devon Ryan 105k

0

Entering edit mode

I meant in terms of both the stabilization & library size. I did not read the published paper but did read the Reference Manual and there is a paragraph explaining VST but there are statistical terms which are do not quite understand (like a gene's dispersion, Poisson noise etc). But I will look at the DESeq paper now that you have mentioned it. Thanks!

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

If you're familiar with terms like "variance" or "standard deviation" as well as what a Poisson distribution is, then at least those terms can be translated to something you're more familiar with. If not, then you'd be well served to just take a decent statistics class, since a lot of things will be pretty tough going otherwise.

ADD REPLY • link 10.7 years ago by Devon Ryan 105k

Ram · Accepted Answer · 2014-08-11

The vignettes in DESeq2 (which you should prefer using these days) describe these things and why you'd want to use them, under these sections:

http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf Under "Data transformations and visualization; and
http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/beginner.pdf Under "working with rlog-transformed data"

The main image you want to have in your mind is the one in Figure 3 of the first link. The point of these transforms is to reduce (ideally eliminate) dependence of the variance on the mean. The second link above has this paragraph which sums things up quite nicely:

Many common statistical methods for exploratory analysis of multidimensional data, especially methods for clustering and ordination (e.g., principal-component analysis and the like), work best for (at least approximately) homoskedastic data; this means that the variance of an observable quantity (i.e., here, the expression strength of a gene) does not depend on the mean. In RNA-Seq data, however, variance grows with the mean.

You should prefer to use these when you are doing downstream analysis on your count data that doesn't involve testing for differential expression using the statistical methods developed for count data. These scenarios include doing things like clustering, or PCA over your expression data or using the data as input to another machine learning algorithm.

Take the time to read the two vignettes above, as well as the DESeq2 preprint to get a better understanding of this (and many other things related to differential expression analysis with this software) ... the authors have gone to great lengths to document their software and methodology quite thoroughly.