Hi all, I'm writing a function to assign elemental composition to each m/z value in a list of m/z values from mass spec (MS1) data. I want to use this function in R but I suspect this is best done in using Rcpp and C++. My current approach is simply to iterate through all possible combinations of elements (considering user-inputted minimum and maximum values for each element). My code is as follows:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerMatrix assignComposition(NumericVector mz,
int c_min, int c_max,
int h_min, int h_max,
int o_min, int o_max,
int n_min, int n_max,
int p_min, int p_max,
int na_min, int na_max,
int cl_min, int cl_max) {
double c_em = 12.0;
double h_em = 1.007825;
double o_em = 15.9949;
double n_em = 14.003074;
double p_em = 30.973762;
double na_em = 22.989770;
double cl_em = 34.968853;
int s = mz.size();
IntegerVector c_max_calc(s);
IntegerVector o_max_calc(s);
IntegerVector n_max_calc(s);
IntegerVector p_max_calc(s);
for(int i = 0; i < s; ++i){
c_max_calc[i] = floor(mz[i] / c_em);
o_max_calc[i] = floor(mz[i] / o_em);
n_max_calc[i] = floor(mz[i] / n_em);
p_max_calc[i] = floor(mz[i] / p_em);
}
IntegerVector c_ret(s);
IntegerVector h_ret(s);
IntegerVector o_ret(s);
IntegerVector n_ret(s);
IntegerVector p_ret(s);
IntegerVector na_ret(s);
IntegerVector cl_ret(s);
for(int i = 0; i < s; ++i){
double error = 1000;
int c_max_realized = std::min(c_max,c_max_calc[i]);
int o_max_realized = std::min(o_max,o_max_calc[i]);
int n_max_realized = std::min(n_max,n_max_calc[i]);
int p_max_realized = std::min(p_max,p_max_calc[i]);
for (int c = c_min; c <= c_max_realized; ++c) {
for (int h = h_min; h <= h_max; ++h) {
for(int o = o_min; o <= o_max_realized; ++o) {
for (int n = n_min; n <= n_max_realized; ++n) {
for (int p = p_min; p <= p_max_realized; ++p) {
for (int na = na_min; na <= na_max; ++na) {
for (int cl = cl_min; cl <= cl_max; ++cl) {
if(abs(mz[i] - ((c * c_em)+(h * h_em)+(o * o_em)+(n * n_em)+(p * p_em)+(na * na_em)+(cl * cl_em)) ) < error) {
error = abs(mz[i] - ((c * c_em)+(h * h_em)+(o * o_em)+(n * n_em)+(p * p_em)+(na * na_em)+(cl * cl_em)) );
c_ret[i] = c;
h_ret[i] = h;
o_ret[i] = o;
n_ret[i] = n;
p_ret[i] = p;
na_ret[i] = na;
cl_ret[i] = cl;
}
}
}
}
}
}
}
}
}
return Rcpp::cbind(c_ret,Rcpp::cbind(h_ret,Rcpp::cbind(o_ret,Rcpp::cbind(n_ret,Rcpp::cbind(p_ret,Rcpp::cbind(na_ret,cl_ret))))));
}
The code runs fine and isn't slow per-se (i can assign composition to 16,000 m/z values from within R in 12.107 seconds), but I'm wondering if anyone has any ideas for how to:
- speed my code up, maybe by using some biologically-informed "tricks" to decrease the number of iterations performed, or else some C++ wizardry i'm not privy to (I've never used C++ before but know some java so writing this script hasn't taken too terribly long)
Return the top number of elemental composition "matches" based on a user-defined variable, n (e.g. user can say, return the 1 best composition and they get 1 composition per m/z, or they can say return the 5 best compositions and they get the 5 best compositions with theoretical mass nearest the observed m/z). I'm fairly stuck for how i'd begin to do this--my only thought is an ugly one and would involve turning error into a list with
NumericVector error(n);
and then each loop comparing like
abs(mz[i] - ((c * c_em)+(h * h_em)+(o * o_em)+(n * n_em)+(p * p_em)+(na * na_em)+(cl * cl_em)) ) < error[n]
This would get messy and complicated fast and i can't help but feel there's a better way which is eluding me.
Thanks in advance!
Brian
Edit: I'm thinking it might be best to abandon the for-loop based approach as it only works for a very fixed set/# of different elements, and would get exponentially slower with the addition of more unique elements (or isotopes). I'm sure there must be open source solutions to this problem but I'm having trouble finding them! Next step might be considering the approach used in more general such problems like this one https://stackoverflow.com/questions/20193555/finding-combinations-to-the-provided-sum-value
Actually i'm thinking this is going to relate to the Knapsack problem
Thanks for following up with a resolution to this thread.
I am actually amazed that you got this much feedback for what seems to be a small-scale optimization problem. If your goal was to shave hours or days of running time, I would understand the appeal of helping you. But I don't get how anyone can invest minutes or hours of their time to help you shave off seconds in your runs. Good for you that there are people like that out there!
I agree! Though to be honest, my actual question was only ~half "optimization for speed" -- the other half was something along the lines "I know there are proprietary softwares (Thermo XCalibur, for example) that do this way better than my C++ brute force implementation ever could, does anybody else have any ideas how those softwares work?" A big reason I asked this question at all was that I figured there was some well established way to solve this problem which for some reason I couldn't find in my own google searches, so I hoped somebody would chime in to that end.
Luckily I ended up enjoying finding the more elegant and faster solution and hopefully it helps others in the future! Maybe i'll edit the title of the post to reflect that it's more about figuring out how to assign elemental composition to m/z in general and less about optimizing for loops in C++.