Question

Why are DESeq2 size factors not null in case of sparse count matrix?

0

Entering edit mode

6.8 years ago

holgerbrandl ▴ 30

Hi,

I've followed the vignette to just perform size factorn normalization using

DESeqDataSetFromMatrix(countData = countMatrix, colData = colData, design = ~ 1)
dds = estimateSizeFactors(dds)
sizeFactors(dds)

I've read the docs of DESeq2::estimateSizeFactors but I still struggle how size factors are calculated by DESeq2. The default type seems to be "ratio" but if so, I would expect many size-factors to become 0 if my count matrix is sparse. However, size factors calculated by DESeq2 seem to be never 0 even on very sparse data.

Is it internally using an altered implementation of the geometric-mean to exclude zeros when running with "ratio", or is it applying some offset addition and/or filtering before calculating the median of the ratios?

To understand it better I've created a clean-room reimplemntaiton of the normalization:

require(tidyverse)

data <- tribble(
~ ensembl_gene_id, ~ sample_A, ~ sample_B, ~ sample_C,
"gene_1", 50, 500, 0,
"gene_2", 30, 0, 10,
"gene_3", 0, 0, 0,
"gene_4", 0, 1000, 0,
"gene_5", 40, 0, 0,
"gene_6", 100, 0, 100,
)

dataLong = gather(data, sample, num_algn, -ensembl_gene_id)

gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) }
# gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x+0.5), na.rm=na.rm) / length(x)) }

size_factors = dataLong %>%
    group_by(ensembl_gene_id) %>%
    mutate(gm_norm_algn=num_algn/gm_mean(num_algn)) %>%
    group_by(sample) %>%
    summarize(size_fac=median(gm_norm_algn))


norm_data = dataLong %>% left_join(size_factors)  %>%
    mutate(size_norm=num_algn/size_fac) %>%
    select(ensembl_gene_id, sample, size_norm)

But using it I get entirely different results caused by diminishing size factors.

Thanks & best regards.

Holger

RNA-Seq deseq deseq2 • 3.8k views

ADD COMMENT • link updated 6.8 years ago by Kevin Blighe 88k • written 6.8 years ago by holgerbrandl ▴ 30

1

Entering edit mode

I don't really understand what you mean by saying that many size factors should be 0. Size factors are computed per sample, not per gene. To answer your question, I think DESeq2 only uses genes that have > 0 counts in each sample to compute the size factors, If I remember correctly.

ADD REPLY • link 6.8 years ago by Martombo ★ 3.1k

0

Entering edit mode

Yes, the size factors can never be 0 because they are used to divide the read counts by genes. By the way, size factors are distributed around 1 and the mean of all size factors in your study is equal to 1.

ADD REPLY • link 6.8 years ago by Carlo Yague 8.9k

0

Entering edit mode

Indeed, when excluding all 0 counts from the data, I get factors that come very close to what DESeq2 is providing. Somehow this tiny but important details seems to be left out very often when explaining size-factor normalization.

Some examples:

# A tibble: 36 x 3
   sample        size_fac size_fac_deseq
   <chr>            <dbl>          <dbl>
 1 P525L-0_1        1.00           1.03 
 2 P525L-0_2        0.939          0.953
 3 P525L-1+3h_1     0.949          0.927
 4 P525L-1+3h_2     1.27           1.28 
 5 P525L-1h_1       0.962          0.970
 6 P525L-1h_2       1.09           1.11 
 7 P525L-2h30m_1    0.841          0.825
 8 P525L-2h30m_2    1.14           1.14 
 9 P525L-30m_1      0.899          0.906
10 P525L-30m_2      0.882          0.879
# ... with 26 more rows

As you can see, there are still small differences, but these could be rounding errors.

I'd love to accept your comment as accepted answer if you would post it as answer.

ADD REPLY • link 6.8 years ago by holgerbrandl ▴ 30

0

Entering edit mode

The "ratio" about which you speak, and the size factor, is calculated as follows:

Per sample, divide each gene's value against the geometric mean of its values across all samples.
We then have a ratio for each gene in each sample
For each gene, choose the geometric median of the ratios, which is then the sizing factor

From this, I cannot see how we would end up with zeros for sizing factors.

Note, however, that if you use this method and each of your genes has at least 1 zero across all samples, then an error is thrown.

ADD REPLY • link 6.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks for detailing out the definition again. I've tried to implement exactly this protocol in my code from above. To illustrate the problem better, I've updated the example data to become a sparse count matrix, which results in following size factors

  sample   size_fac
  <chr>       <dbl>
1 sample_A     3.10
2 sample_B     0   
3 sample_C     0

It could well be that my implementation is incorrect, but I can't find what could be wrong.

See @Martombo's comment, which was imho the best pointer so far.

ADD REPLY • link 6.8 years ago by holgerbrandl ▴ 30

score 1 · Accepted Answer · 2018-02-15

1

Entering edit mode

6.8 years ago

Martombo ★ 3.1k

I'm pretty sure that genes with no counts in even only one sample are excluded from the computation. I can't find any reference for this at the moment, but if you look at the code (around line 500), you'll see that only genes with a finite loggeomean are kept (all non-zero counts). Notice also that if "every gene contains at least one zero, cannot compute log geometric means".

ADD COMMENT • link 6.8 years ago by Martombo ★ 3.1k

0

Entering edit mode

Thanks Martombo. If I get a spare couple of hours I will take a look

ADD REPLY • link 6.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Indeed, genes with no counts in even only one sample are excluded. I obtain the exactly same factors as provided by DESeq2 when adopting my implementation from above to include such a filtering step:

size_factors = dataLong %>%
        group_by(ensembl_gene_id) %>%
        filter(!any(num_algn==0)) %>% 
        mutate(gm_norm_algn = num_algn / gm_mean(num_algn)) %>%
        group_by(sample) %>%
        summarize(size_fac = median(gm_norm_algn))

Maybe you could change your comment into answer so that I could accept it?

Personally I find this filtering rather harsh, since in large experimental setups there will often be a sample with a 0 count for any given gene. So the more samples are part of the experiment, the more genes are likely to be excluded when calculating size factors.