Entering edit mode
10.6 years ago
Luyi Tian
▴
120
I am analyzing a RNA-seq data set. It has been processed to remove technical variants, thus unlike raw RPKM values that start form 0, it contains some negative values. I am reluctant to add a constant so that every values is positive, then I could perform log transformation. Is there any other ways to transform the highly skewed distribution to standard normal distribution. I am a newbie in statistics.
Thanks in advance
Do you still have the raw data? It sounds like the processing was simply done incorrectly (or it's already on a log scale).
the data is from a recent Nature article: http://www.nature.com/nature/journal/v501/n7468/full/nature12531.html
And they used a method called PEER to detected batch effects and experimental confounders:https://www.sanger.ac.uk/resources/software/peer/
no the processed data is heavily skewed and should not in a log scale.
Now I am reading the article about PEER and try to understand why it generate such data.
Interesting, I'll have to read about how PEER differs from SVA/combat.
PEER method is very similar to SVA. It tries to identify 'hidden' confounders and regress them out of your expression values. The only difference is that it uses a Bayesian approach to identify these hidden confounders. The resulting residuals will have both positive and negative numbers and represent relative expression. Its important to understand that these values are quite different from raw read counts. They only contain information on RELATIVE expression WITHIN a gene between samples. Dont add constants just use these values in your statistical analysis, if you are looking for differential expression between treatment groups.
How was the RPKM computed?
It won't be correct to deliberately convert a distribution to normal distribution. If you plot the gene expression values for different genes then you usually get a power-law kind of distribution rather than normal distribution.
Not all random variables are normally distributed. Statistically speaking, the sum of independent and identically distributed (IID) random variables will converge to normal distribution when n goes to infinity (That is the central limit theorem (CLT)).
Different genes are not IID random variables. They are different variables of a multivariate function and moreover they are not independent; if they were independent it would mean that there is no gene regulation.
Note: random variable is not really a variable, nor it is random. RV is a function.
If you measure expression of gene-x 100 times then this measure will follow normal distribution and this is in accordance with the CLT.