I do cox analysis(100+ genes Univariate) with survival
package, then p.adjust
and filter by p.adj < 0.05
.
First I use origin FPKM-UQ value from gdc, I get a few genes but all HR value very close to 1(1.000000xxxx). Then I use log2(FPKM + 1)
get more genes and HR value seems normal(discrete from 1).
This seems I should use log2 FPKM value. But I don't figure out why origin FPKM value will let HR value near 1.
I think your definition of heteroscedasticity is off, based on a typical "mean variance plot" from an RNA-seq experiment, you would see that the higher the mean counts the lower the variance.
High variance typically comes from low counts, here is an explanation why:Edit: Sorry, was mixing up high variance with artificially high fold changes which is what the below post refers to:
A: Volcano plot: why is there big FC with big p-values?
vst
does 1) normalize counts based on the RLE strategy fromDESeq2
, 2) transform to log2-like scale and 3) tries to remove the dependency of the variance from the mean(which is essentially high variance based on small counts).Original text from the source function:
Another video that explains why the e.g. DESeq2 size factors are superior can be found here:
Hmm, ATpoint, I tend to disagree that "high variance typically comes from low counts". In RNA-seq experiments, genes with larger average expression have larger variances. For example, see Figure 1a of Simon Anders and Wolfgang Huber's paper in Genome Biology, 2010.
In Poisson distribution data, the mean equals the variance. Therefore, the higher the mean, the higher the variance. (In negative binomial distribution data, like RNA-seq, it's even worse because as the mean gets higher, the variance tends to grow even faster -- i.e. overdispersion).
I do agree that smaller counts are more unreliable. Consider the effects of Poisson noise (shot noise). The standard deviation (noise) equals to the square root of the mean (so yes, standard deviation is higher for higher means), yet shot noise tends to have a bigger effect for lowly expressed genes. That's because it's all relative. If the mean is 1, numbers like 0, 1, and 2 are all pretty different. If the mean is 10000, then numbers like 10017, 10001, and 9982 don't really make much of a difference.
If there's something I'm misunderstanding, please let me know.
You are right, sorry was mixing up variance with fold changes. Edited my comment accordingly.
Hi, great answer! Very informative!
I watched the video you shared very carefully, and since I am bit new to this field, can you kindly evaluate if I get it right in the following statement and maybe answer my question.
Thank you.