Hi all
I have a data frame with various SNPs and other columns with some information and values such as the following:
SNP ID. location chromosome values column A. values column B
rs8662689 78654 1 0.6432 0.2458
rs753279 1009753 7 -1.6434 1.9876
rs4331780 2086433 22 4.521 -3.743
and so on...... ..... ......
I would like to standardise the values in column B. I understand I have to divide the column B in various frequency bins ( I have 20x10^6 rows so I guess I need to set a bin value by which to divide, for instance 1000). Then I would like to calculate the mean of each bin and the standard deviation and divide each value in column B by that particular mean and standard deviation calculated previously in each bin and create a new column. Anybody knows how to do this by a R code?
I have written something like this but it does not seem to work, might be wrong:
library(dplyr)
n_bins = 1000
outscore = df %>% mutate(bin=ntile(mean(df$valuesB),n_bins)) %>%
group_by(bin) %>% mutate(zscore=scale(mean()),outlier=abs(zscore)>1.7)
Any help highly appreciated. Thanks
for z score calculation, R has scale function.
Yes, I know, but if I apply only scale to the column it calculates the scores for each value of column B but it is not dividing by frequency bins
Not clear what you want to do, maybe provide example input data, and expected output? Also, this
ntile(mean(df$valuesB),n_bins)
creates only 1 bin, and thisscale(mean())
meant to bescale(mean(valuesB))
?Apologies for the not so clear explanation Here is a sample of my data, first 20 rows:
I would like to standardise the values of the XPEHH column. I guess from what I understand I need to divide the values in bins and calculate the z-scores on the separate bins. So in this example for instance if I group by 10 bins I calculate the mean of the first 10 values and their standard deviation and calculate the z-score of those 10 SNPs based on the mean and SD previously calculated. The following is again done on the following 10 values/SNPs. Hope that is a bit clearer. Yes, I meant scale(mean(valuesB)) If I do only df$zscores<-scale(df$XPEHH) it calculates the zscore globally, I want to calculates the zscores on the separate bins.
Thank you very much, I think I can see this is most likely what I need because in the bin column it is grouping SNPs with similar values and then it calculates the zscore. Thanks a lot.
Please use "add comment" link to add comments, at the moment you are posting into "Add your answer" box which is only for Answers. If the answer worked for you, please consider to accept/upvote, so we can have your question as "resolved".