Metabolomics log-transform or standardize for WGCNA
2
0
Entering edit mode
8.7 years ago
jane.toka ▴ 20

Hi all,

I have targeted metabolomics data which have been normalized using the loess method. I want to ask if further data pre-processing is needed namely standardization (divide with standard deviation of each metabolite) or log transformation of each metabolite because I have huge values. The distributions look slightly skewed for some metabolites.

I want to run a network analysis using WGCNA (weighted gene co-expression network analysis) which is based on computing pairwise correlations. Thus I'm wondering if it is important to standardize or log-transform the data or apply another pre-processing approach before starting the analysis of the metabolites.

Thanks in advance for your help.

Best, Jane

networks analysis WGCNA metabolites • 4.9k views
ADD COMMENT
2
Entering edit mode
7.2 years ago

Hi Jane,

Apologies that no-one had answered. I have just been working on metabolomics and network analysis during my postdoc in Boston.

You may want to take a look at my recent answer here: The RNA-Seq data input for WGCNA in terms of gene co-expression network construction

Your post popped up on the right as a 'similar post'.

From my experience with metabolomics, specifically, note the following processing steps (from the raw metabolite levels):

1) Remove metabolites if:

  • Level in QC samples has coefficient of variation (CoV) > 25%
  • Missingness is > 10% across test samples
  • No variability across test samples based on interquartile range (IQR)

2) Remove samples if

  • Metabolite missingness > 10%

3) Filter out unidentified/unknown metabolites and those classified as xenobiotic chemicals

4) Impute / Convert NA values to 0

[You can also impute NAs with half the lowest value in the dataset]

After that, you could log the data or convert it to the Z scale. WGCNA will accept unlogged data, too. At the end of the day, WGCNA is based on correlation.

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin,

Thanks for an excellent summary!

I have a Q related to this topic, do you think both normalizing and log transformation are required or is only one of them enough?

Also, if both are needed, which one should come first, normalization or log transformation? Does that matter?

My plan is to compare the means of two groups and I want to get rid of inter-and intra-batch variations for getting the true, biological variation.

One last thing, do you believe raw-p values are informative or "multiple testing correction" is essential? If so, which method is your fav, BH or Bonferroni correction?

Thanks in advance! I hope you see my comment :D

Cheers, Nilay

ADD REPLY
0
Entering edit mode

Hi Nilay, we followed the 4-step filter procedure above (I just modified my post), and then always logged (natural log) followed by Z-transformed the data, prior to any downstream analysis. If you are dealing with multiple batches, this would be the best procedure, but you should additionally check for batch effects via PCA. If there are batch effects, you can remove them via limma::removeBatchEffect(), preferably on the log-transformed level, prior to Z-scaling.

I would choose BH, as Bonferroni is too harsh.

ADD REPLY
0
Entering edit mode

Thaaaanks a bunch!

Btw, since log2 transformation could generate some negative values (some of my concentrations are quite low), I was planning to do a "log2(x+1)" conversion, does that sound good too?

About the steps above:

1-) CoV check: I got different results with multiple packages and manual calculations, which package do you use to calculate RSD and filter metabolites? I got NAs in some metabolites so I assume mean and SD calculation should be done excluding NAs?

2-) So you calculate the IQR to remove non-informative variables in other words variables that show low repeatability?

I appreciate your help, many thanks again! :)

PS: I'll use a 30% cut-off for metabolite and sample filtering, does that sound terrible or reasonable?

ADD REPLY
0
Entering edit mode

Hi Kevin,

I looked through the methods in your paper here: https://www.sciencedirect.com/science/article/pii/S0012369218308924 and noticed that you imputed missing data with the median peak intensity for that feature. Would you recommend imputing this way instead of using a small value eg. half the lowest value in the dataset? If so, why and do you have any papers where this method was used? Thank you!

ADD REPLY
1
Entering edit mode
7.2 years ago
theobroma22 ★ 1.2k

If you can access the mzdata files, the XCMS package on Bioconductor is very handy and thorough! It will annotate your fragments, or you could plug them into WGCNA but I've never did it this way. In XCMS, parse out the highest peak in the peak group and use this as your representative peak for that metabolite.

ADD COMMENT

Login before adding your answer.

Traffic: 2106 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6