Question

Interesting Math Problem: How Do You "Un"-Sizefactor The Normalized Counts In Deseq?

2

Entering edit mode

11.3 years ago

lwc628 ▴ 230

Let's say I have a deseq results like this.

baseMeanA  baseMeanB
3479.11063850396        20.4138377244996
9918.86007582945        369.393254062374
1209.76762592568        59.2973381521179
948.474278149218        9009.30704907915
1341.44301378154        127.34346390045
2394.84611662839        326.621403591993
769.478047782662        4793.36351521464
2817.64755732181        427.718504703801
1808.47915508278        266.351977929185
   ...                        ....

This was generated from

baseMeanA = the list of integer counts in A  /  sizeFactorA
baseMeanB = the list of integer counts in B / sizeFactorB

Just by Iist of normalized values, is there a way to infer(reverse engineer) the sizeFactorA and sizeFactorB used to get the normalized values?

My goal is to recover raw counts from this.

deseq rna-seq expression • 2.9k views

ADD COMMENT • link updated 11.3 years ago by matted 7.8k • written 11.3 years ago by lwc628 ▴ 230

0

Entering edit mode

Are you interested this in an academic sense? I mean -- are you just trying to find a clever way to solve the riddle, or are you constrained by not having the original data? I ask because you can simply use the sizeFactors() function on your original CountDataSet (DESeq) or DESeqDataSet (DESeq2) to get these numbers.

ADD REPLY • link 11.3 years ago by Steve Lianoglou 5.2k

score 2 · Answer 1 · 2013-11-15

That's a fun problem. It seems like finding the approximate lowest common multiple, but the non-integer aspect means you can't do too many tricks mathematically (I think).

In practice, you can probably just sort them and look at the smallest two or three values which should correspond to raw counts of 0, 1, or 2, and get the factor easily.

You can also write code to try a range of normalization values and see which work to give you almost-integral answers.

I did that and got 1.458132 for the left column and 1.028714 for the right. That gives raw counts of:

5073.003    21.000
14463.007    380.000
1764.001    61.000
1383.001    9268.000
1956.001    131.000
3492.002    336.000
1122.001    4931.000
4108.502    440.000
2637.001    274.000

I guess I should double the left normalization factor to get rid of the half in one of the numbers.