Hey ab123,
My initial question back to you would be from where did you recruit the controls?; and to which tissues are we referring here? Tissues like blood serum/plasma will show a wider variation than others, particularly for metabolomics.
I've been working on metabolomics for the past year in the USA and took a lot of time to specifically look at the control samples that we had over there. They exhibit very high variability, as does everything in metabolomics!, but, actually once you normalise their metabolite levels (from m/z ratios), profiles of even different groups of healthy controls (processed in the same way but in different batches) actually match very well when looking at natural log counts (and after removing metabolites by the criteria that I mention below). What I'm comparing here are 2 distributions (one in red, the other blue) on the 2 batches of 15 randomly selected controls:
Natural log histogram
Natural log line plot
The distribution then gets a bit out of control if you further convert these to Z-scores:
Getting back to the main point. We did not [edit:] re-do the pre-processing / normalisation of the test sample metabolite levels based on the QC sample levels. The QC samples were purely used for identifying problematic metabolites, which we then filtered out of the main data. We specifically applied the following filtering criteria:
Remove metabolites if:
- Level in QC samples had coefficient of variation (CoV) > 25%
- Levels in QC samples had intraclass correlation (ICC) > 0.4
- Missingness > 10% across cases and controls
- No variability across cases and controls based on interquartile range
(IQR)
Then individual samples were removed if >10% of their metabolites had missingness
For everything else that remained, we converted NAs to half the lowest level, to zero, or imputed with the median level (of each metabolite), depending on the type of downstream analyses.
After all of that, your aim should be to get the levels in your cases and control in a normalised distribution and then conduct the differential analysis. I generally found that logging and then conversion to Z scores worked, followed by independent regression modelling predicting case/control status on a per-metabolite basis. We did not actually use XCMS.
Hope that this helps!
Kevin
Hi Kevin, thank you for the extensive and informative reply!
As per your question, we are looking at organ tissues here. The QCs are pooled from the samples. They cluster nicely in PCA suggesting small instrumental variation.
I'm still a bit confused as to the actual preprocessing step. Wouldn't I want my samples to be peak-aligned according to the QCs? If I preprocess all samples together, peak alignment etc is performed across 3 groups and the intensities reflect that. If I do it for just the samples, I end up with different features which makes it difficult to then filter the QC features against. I could technically subject both QCs and 2-group samples separate preprocessing. That leads to slightly different lists of metabolites. I could then try to filter QCs (CoV > 30%) against samples? But again, samples are then no longer aligned according to the features present in QCs.
Sorry, if that sounds confusing. I am confused right now. More so about the actual inputting steps I guess.
Your removal of metabolites processing looks sound, but how is it applied?
Hi! - no problem. You mentioned in your original message about re-doing the pre-processing step after the initial filtering, which is something that we didn't do.
For us, all samples (QC, cases, controls) undergo together the initial pre-processing step for peak area identification, m/z ratio calculation, etc. (as you have done), and then we filter out metabolites/samples that meet the filtering criteria that I mentioned above. We then proceed with that same data for downstream testing (less the QC samples). There are no further pre-processing steps and the pre-processing step is not re-done.
The first 2 QC criteria:
Just calculate these using the QC samples. Any metabolites that meet these criteria, remove them from all cases and controls.
The other criteria:
These are only applied to the cases and controls. The 10% cutoff is a bit meaningless in your data as you only have 10 samples (we had hundreds).
The final one: "Then individual samples were removed if >10% of their metabolites had missingness". If any sample has >10% of its metabolites with missing values, it should be removed from the dataset.
Hope that this clarifies it a bit?
Edit: it is interesting that you get different results when you re-do the pre-processing, but it's also expected based on the wide variation that metabolites exhibit. A lot of processing methods in this field are liable to change
Awesome, thank you once again for the very detailed answer. The above definitely solves it for me! I still think it may an interesting question to see if preprocessing with the QCs biases the samples towards the QCs...
Well, we should use the word 'solved' very lightly! With metabolomics, I think that it's open game with regard to how the data is processed. Your logic does make sense, i.e., yo go back and re-perform the pre-processing step with just the cases/controls (and after they've been filtered).
"Solves" my question.