Hi,
I've followed the vignette to just perform size factorn normalization using
DESeqDataSetFromMatrix(countData = countMatrix, colData = colData, design = ~ 1)
dds = estimateSizeFactors(dds)
sizeFactors(dds)
I've read the docs of DESeq2::estimateSizeFactors
but I still struggle how size factors are calculated by DESeq2. The default type seems to be "ratio" but if so, I would expect many size-factors to become 0 if my count matrix is sparse. However, size factors calculated by DESeq2 seem to be never 0 even on very sparse data.
Is it internally using an altered implementation of the geometric-mean to exclude zeros when running with "ratio", or is it applying some offset addition and/or filtering before calculating the median of the ratios?
To understand it better I've created a clean-room reimplemntaiton of the normalization:
require(tidyverse)
data <- tribble(
~ ensembl_gene_id, ~ sample_A, ~ sample_B, ~ sample_C,
"gene_1", 50, 500, 0,
"gene_2", 30, 0, 10,
"gene_3", 0, 0, 0,
"gene_4", 0, 1000, 0,
"gene_5", 40, 0, 0,
"gene_6", 100, 0, 100,
)
dataLong = gather(data, sample, num_algn, -ensembl_gene_id)
gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) }
# gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x+0.5), na.rm=na.rm) / length(x)) }
size_factors = dataLong %>%
group_by(ensembl_gene_id) %>%
mutate(gm_norm_algn=num_algn/gm_mean(num_algn)) %>%
group_by(sample) %>%
summarize(size_fac=median(gm_norm_algn))
norm_data = dataLong %>% left_join(size_factors) %>%
mutate(size_norm=num_algn/size_fac) %>%
select(ensembl_gene_id, sample, size_norm)
But using it I get entirely different results caused by diminishing size factors.
Thanks & best regards.
Holger
I don't really understand what you mean by saying that many size factors should be 0. Size factors are computed per sample, not per gene. To answer your question, I think DESeq2 only uses genes that have > 0 counts in each sample to compute the size factors, If I remember correctly.
Yes, the size factors can never be 0 because they are used to divide the read counts by genes. By the way, size factors are distributed around 1 and the mean of all size factors in your study is equal to 1.
Indeed, when excluding all 0 counts from the data, I get factors that come very close to what DESeq2 is providing. Somehow this tiny but important details seems to be left out very often when explaining size-factor normalization.
Some examples:
As you can see, there are still small differences, but these could be rounding errors.
I'd love to accept your comment as accepted answer if you would post it as answer.
The "ratio" about which you speak, and the size factor, is calculated as follows:
From this, I cannot see how we would end up with zeros for sizing factors.
Note, however, that if you use this method and each of your genes has at least 1 zero across all samples, then an error is thrown.
Thanks for detailing out the definition again. I've tried to implement exactly this protocol in my code from above. To illustrate the problem better, I've updated the example data to become a sparse count matrix, which results in following size factors
It could well be that my implementation is incorrect, but I can't find what could be wrong.
See @Martombo's comment, which was imho the best pointer so far.