I have been using the DESeq VST method on gene counts produced by Htseq-count as follows:
cds <- newCountDataSet(countData = dat, conditions = factor(conditions))
cds <- estimateSizeFactors(cds)
cds <- estimateDispersions(cds, sharingMode = "gene-est-only", method = "pooled", fitType = "local")
vst <- getVarianceStabilizedData(cds)
But honestly, I do not understand what exactly the getVarianceStabilizedData() function does. Can someone explain in simple terms:
- Why is it necessary to normalize raw count data? Why can't we use the raw count data?
- How exactly are we normalizing the raw count data using
getVarianceStabilizedData()
function? - Should the conditions parameter in the newCountDataSet() function match the conditions between which you want to find differentially expressed genes? For e.g. I have both cases & controls as well as males & females. So should I include both the information in the conditions parameter or just cases & controls?
I know these questions can be searched for easily, and I did. But I want a simple explanation from someone who uses these methods regularly to clear my concepts.
For question 1, do you mean in the sense of variance stabilization or in the sense of library size? Also, have you read the DESeq paper (and the DESeq2 preprint, since you should switch to DESeq2 if possible)?
I meant in terms of both the stabilization & library size. I did not read the published paper but did read the Reference Manual and there is a paragraph explaining VST but there are statistical terms which are do not quite understand (like a gene's dispersion, Poisson noise etc). But I will look at the DESeq paper now that you have mentioned it. Thanks!
If you're familiar with terms like "variance" or "standard deviation" as well as what a Poisson distribution is, then at least those terms can be translated to something you're more familiar with. If not, then you'd be well served to just take a decent statistics class, since a lot of things will be pretty tough going otherwise.