Question

Normalize leptokurtic quantitative outcome measure in R?

0

Entering edit mode

9.1 years ago

dam4l ▴ 200

I have a datafile with 4 columns that looks like this:

fid     iid     phen        sig
0002    0002    -.268465    0
0005    0005    -.033474    0
0081    0081    .2921848    0
0091    0091    1.836548    1
0094    0094    .9888859    1
0095    0095    -.1503887   0

The values in the 'phen' column have a leptokurtic distribution. I want to perform quantile normalization to give them a normal distribution.

I read the data into R using data <- read.table('phenfile.txt'). The 'cape' package (norm.pheno function) gives back an error message (dim(X) must have a positive length) and the 'preprocessCore' package (using normalize.quantiles(as.matrix(data[,"phen", drop = FALSE]))) failed to normalize the distribution.

Are there any other packages/functions in R that could be used to normalize a single column of values within a document?

Any input would be greatly appreciated.

r gwas SNP • 2.3k views

ADD COMMENT • link 9.1 years ago by dam4l ▴ 200

score 1 · Answer 1 · 2016-04-08

1

Entering edit mode

9.1 years ago

Michael 55k

See Crossvalidated for the problem of conversion: https://stats.stackexchange.com/questions/85687/how-to-transform-leptokurtic-distribution-to-normality

You first should make a QQ-plot to analyze the deviation from normal. You need to apply a transformation, not a normalization. A transformation can be any function, like logarithm (for an exponential variable), sqrt, etc.

Regarding quantile normalization:

quantile normalization is not the right tool
quantile normalization cannot be applied to a single column, it would not change anything at best
quantile normalization does not make the data more like a normal distribution, instead it makes that all values of all columns are sampled from the same values, or effectively that each column is a different permutation of the same set of values.

ADD COMMENT • link 9.1 years ago by Michael 55k

0

Entering edit mode

I plotted the values in the 'phen' column in a histogram to see the distribution - skewness was okay but it was leptokurtic. I'm using the values in 'phen' in a linear regression. The resulting p-vals from the regression are not normally distributed (I plotted them on -log(p) transformed QQ plot). After looking at the post you linked to, I'm still not sure what the best way to normalize the distribution of the 'phen' column would be.

ADD REPLY • link 9.1 years ago by dam4l ▴ 200

0

Entering edit mode

p-values are not expected to be normal, under the true null hypotheses they should be uniformly distributed, e.g.: 1% with p<0.01, 5% with p<0.05, for any p etc. uniform is the only distribution satisfying this. (see https://stats.stackexchange.com/questions/10613/why-are-p-values-uniformly-distributed-under-the-null-hypothesis )

You need to make a QQ plot of the data (phen), not the p-values.

The answer on Crossv. states that it might not be always exist a closed form transformation to make a non-normal variable normal. The second answer points to possible transformation. If you want to go into the guts of the statistics behind, you should rather ask this Q on Crossvalidated.

ADD REPLY • link 9.1 years ago by Michael 55k