Question

Cox proportional hazards regression use log2 fpkm

0

Entering edit mode

5.1 years ago

MatthewP ★ 1.4k

I do cox analysis(100+ genes Univariate) with survival package, then p.adjust and filter by p.adj < 0.05.

First I use origin FPKM-UQ value from gdc, I get a few genes but all HR value very close to 1(1.000000xxxx). Then I use log2(FPKM + 1) get more genes and HR value seems normal(discrete from 1).

This seems I should use log2 FPKM value. But I don't figure out why origin FPKM value will let HR value near 1.

cox fpkm • 3.4k views

ADD COMMENT • link updated 9 weeks ago by Charly • 0 • written 5.1 years ago by MatthewP ★ 1.4k

score 4 · Accepted Answer · 2019-11-02

4

Entering edit mode

5.1 years ago

dsull ★ 6.9k

First off, do not log FPKMs. An explanation of why not to do so is provided here: (see 25:50 - 29:10).

Second off, metrics like upper-quartile normalization of FPKM or TPM (TPM is better than FPKM by the way) doesn't fix problems with between-samples comparisons. A better way is to use DESeq2 to normalize the data. DESeq2 has a vst function that normalizes your count data and corrects heteroscedasticity (i.e. corrects for the fact that genes with higher average expression have higher variances) on a log2-scale. You can use DESeq2 on raw RNA-seq counts (which are obtainable from GDC).

Third, (without playing around with the actual expression & survival data on my own), I don't have a perfect explanation why your HR's are close to 1, but here are some ideas. Cox regression assumes a linear relationship between the log Hazard and your variable (expression). (You can check whether this assumption holds by analyzing the residuals.) Hence, this is why log2 would fit much better for count data (which, otherwise, is Poisson or Negative Binomially distributed).

ADD COMMENT • link 5.1 years ago by dsull ★ 6.9k

1

Entering edit mode

I think your definition of heteroscedasticity is off, based on a typical "mean variance plot" from an RNA-seq experiment, you would see that the higher the mean counts the lower the variance.

ADD REPLY • link 5.1 years ago by Haci ▴ 730

0

Entering edit mode

~~High variance typically comes from low counts, here is an explanation why:~~

Edit: Sorry, was mixing up high variance with artificially high fold changes which is what the below post refers to:

A: Volcano plot: why is there big FC with big p-values?

vst does 1) normalize counts based on the RLE strategy from DESeq2, 2) transform to log2-like scale and 3) tries to remove the dependency of the variance from the mean ~~(which is essentially high variance based on small counts).~~

Original text from the source function:

This function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size.

Another video that explains why the e.g. DESeq2 size factors are superior can be found here:

ADD REPLY • link 5.1 years ago by ATpoint 85k

1

Entering edit mode

Hmm, ATpoint, I tend to disagree that "high variance typically comes from low counts". In RNA-seq experiments, genes with larger average expression have larger variances. For example, see Figure 1a of Simon Anders and Wolfgang Huber's paper in Genome Biology, 2010.

In Poisson distribution data, the mean equals the variance. Therefore, the higher the mean, the higher the variance. (In negative binomial distribution data, like RNA-seq, it's even worse because as the mean gets higher, the variance tends to grow even faster -- i.e. overdispersion).

I do agree that smaller counts are more unreliable. Consider the effects of Poisson noise (shot noise). The standard deviation (noise) equals to the square root of the mean (so yes, standard deviation is higher for higher means), yet shot noise tends to have a bigger effect for lowly expressed genes. That's because it's all relative. If the mean is 1, numbers like 0, 1, and 2 are all pretty different. If the mean is 10000, then numbers like 10017, 10001, and 9982 don't really make much of a difference.

If there's something I'm misunderstanding, please let me know.

ADD REPLY • link 5.1 years ago by dsull ★ 6.9k

1

Entering edit mode

You are right, sorry was mixing up variance with fold changes. Edited my comment accordingly.

ADD REPLY • link 5.1 years ago by ATpoint 85k

0

Entering edit mode

Hi, great answer! Very informative!

I watched the video you shared very carefully, and since I am bit new to this field, can you kindly evaluate if I get it right in the following statement and maybe answer my question.

My understanding of the reason that one should not do log FPKM is beacuse of the defect of FPKM it self instead of log calculation process. FPKM could neither solve the problem of high variance in high mean situation, nor could it solve isoform level problems.
How do we know that high variance observed in highly expressed genes is due to technical artifacts instead of biological reasons?

Thank you.

ADD REPLY • link 9 weeks ago by Charly • 0