I have a datafile with 4 columns that looks like this:
fid iid phen sig
0002 0002 -.268465 0
0005 0005 -.033474 0
0081 0081 .2921848 0
0091 0091 1.836548 1
0094 0094 .9888859 1
0095 0095 -.1503887 0
The values in the 'phen' column have a leptokurtic distribution. I want to perform quantile normalization to give them a normal distribution.
I read the data into R using data <- read.table('phenfile.txt')
. The 'cape' package (norm.pheno
function) gives back an error message (dim(X) must have a positive length
) and the 'preprocessCore' package (using normalize.quantiles(as.matrix(data[,"phen", drop = FALSE]))
) failed to normalize the distribution.
Are there any other packages/functions in R that could be used to normalize a single column of values within a document?
Any input would be greatly appreciated.
I plotted the values in the 'phen' column in a histogram to see the distribution - skewness was okay but it was leptokurtic. I'm using the values in 'phen' in a linear regression. The resulting p-vals from the regression are not normally distributed (I plotted them on -log(p) transformed QQ plot). After looking at the post you linked to, I'm still not sure what the best way to normalize the distribution of the 'phen' column would be.
p-values are not expected to be normal, under the true null hypotheses they should be uniformly distributed, e.g.: 1% with p<0.01, 5% with p<0.05, for any p etc. uniform is the only distribution satisfying this. (see https://stats.stackexchange.com/questions/10613/why-are-p-values-uniformly-distributed-under-the-null-hypothesis )
You need to make a QQ plot of the data (phen), not the p-values.
The answer on Crossv. states that it might not be always exist a closed form transformation to make a non-normal variable normal. The second answer points to possible transformation. If you want to go into the guts of the statistics behind, you should rather ask this Q on Crossvalidated.