Hi all,
I have targeted metabolomics data which have been normalized using the loess method. I want to ask if further data pre-processing is needed namely standardization (divide with standard deviation of each metabolite) or log transformation of each metabolite because I have huge values. The distributions look slightly skewed for some metabolites.
I want to run a network analysis using WGCNA (weighted gene co-expression network analysis) which is based on computing pairwise correlations. Thus I'm wondering if it is important to standardize or log-transform the data or apply another pre-processing approach before starting the analysis of the metabolites.
Thanks in advance for your help.
Best, Jane
Hi Kevin,
Thanks for an excellent summary!
I have a Q related to this topic, do you think both normalizing and log transformation are required or is only one of them enough?
Also, if both are needed, which one should come first, normalization or log transformation? Does that matter?
My plan is to compare the means of two groups and I want to get rid of inter-and intra-batch variations for getting the true, biological variation.
One last thing, do you believe raw-p values are informative or "multiple testing correction" is essential? If so, which method is your fav, BH or Bonferroni correction?
Thanks in advance! I hope you see my comment :D
Cheers, Nilay
Hi Nilay, we followed the 4-step filter procedure above (I just modified my post), and then always logged (natural log) followed by Z-transformed the data, prior to any downstream analysis. If you are dealing with multiple batches, this would be the best procedure, but you should additionally check for batch effects via PCA. If there are batch effects, you can remove them via
limma::removeBatchEffect()
, preferably on the log-transformed level, prior to Z-scaling.I would choose BH, as Bonferroni is too harsh.
Thaaaanks a bunch!
Btw, since log2 transformation could generate some negative values (some of my concentrations are quite low), I was planning to do a "log2(x+1)" conversion, does that sound good too?
About the steps above:
1-) CoV check: I got different results with multiple packages and manual calculations, which package do you use to calculate RSD and filter metabolites? I got NAs in some metabolites so I assume mean and SD calculation should be done excluding NAs?
2-) So you calculate the IQR to remove non-informative variables in other words variables that show low repeatability?
I appreciate your help, many thanks again! :)
PS: I'll use a 30% cut-off for metabolite and sample filtering, does that sound terrible or reasonable?
Hi Kevin,
I looked through the methods in your paper here: https://www.sciencedirect.com/science/article/pii/S0012369218308924 and noticed that you imputed missing data with the median peak intensity for that feature. Would you recommend imputing this way instead of using a small value eg. half the lowest value in the dataset? If so, why and do you have any papers where this method was used? Thank you!