The R code that computes size factors in DESeq can be viewed by:
library(DESeq)
estimateSizeFactorsForMatrix
function (counts, locfunc = median)
{
loggeomeans <- rowMeans(log(counts))
apply(counts, 2, function(cnts) exp(locfunc((log(cnts) -
loggeomeans)[is.finite(loggeomeans)])))
}
<environment: namespace:DESeq>
You can see how it works in the following example, using crudely simulated RNA-Seq data:
library(DESeq)
set.seed(31415926)
true_size_factors = c(1, 1.2, 1.5, 2.0)
true_mean_counts = c(10, 20, 100, 500, 1000, 5000)
e = expand.grid(true_size_factors, true_mean_counts)
lambdas = e$Var1 * e$Var2
raw_data = matrix(rpois(n=nrow(e), lambda=lambdas), ncol=4, byrow=TRUE)
rownames(raw_data) = paste("gene", 1:nrow(raw_data), sep="_")
colnames(raw_data) = paste("sample", 1:ncol(raw_data), sep="_")
raw_data
sizes = estimateSizeFactorsForMatrix(raw_data)
sizes
raw_data / do.call(rbind, rep(list(sizes), 6))
Compare this to the more automated method described in the DESeq vignette:
cds = newCountDataSet(raw_data, conditions=c("trt", "trt", "cont", "cont"))
cds = estimateSizeFactors(cds)
cts = counts(cds, normalized=TRUE)
cts
I would also add that the variance stabilizing transformation provided by DESeq is very useful. The Authors recommend using it for clustering, heatmaps, or other visualization.
cds = estimateDispersions(cds, method="blind")
vsd = getVarianceStabilizedData(cds)
vsd
Hello! Thanks for the comments. Must you eliminate those probes that are always zero in all your samples before doing the normalization or it is not necessary?
Please do not add an answer unless you're answering the top level question. I'm moving this to a comment, but please be more careful in the future.