Help speeding up C++ algorithm to assign elemental composition to single m/z (=mass) value
2
0
Entering edit mode
19 months ago
bkleiboeker ▴ 370

Hi all, I'm writing a function to assign elemental composition to each m/z value in a list of m/z values from mass spec (MS1) data. I want to use this function in R but I suspect this is best done in using Rcpp and C++. My current approach is simply to iterate through all possible combinations of elements (considering user-inputted minimum and maximum values for each element). My code is as follows:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerMatrix assignComposition(NumericVector mz,
                                int c_min, int c_max,
                                int h_min, int h_max,
                                int o_min, int o_max,
                                int n_min, int n_max,
                                int p_min, int p_max,
                                int na_min, int na_max,
                                int cl_min, int cl_max) {
    double c_em = 12.0;
    double h_em = 1.007825;
    double o_em = 15.9949;
    double n_em = 14.003074;
    double p_em = 30.973762;
    double na_em = 22.989770;
    double cl_em = 34.968853;

    int s = mz.size();
    IntegerVector c_max_calc(s);
    IntegerVector o_max_calc(s);
    IntegerVector n_max_calc(s);
    IntegerVector p_max_calc(s);

    for(int i = 0; i < s; ++i){
        c_max_calc[i] = floor(mz[i] / c_em);
        o_max_calc[i] = floor(mz[i] / o_em);
        n_max_calc[i] = floor(mz[i] / n_em);
        p_max_calc[i] = floor(mz[i] / p_em);
    }

    IntegerVector c_ret(s);
    IntegerVector h_ret(s);
    IntegerVector o_ret(s);
    IntegerVector n_ret(s);
    IntegerVector p_ret(s);
    IntegerVector na_ret(s);
    IntegerVector cl_ret(s);

    for(int i = 0; i < s; ++i){
        double error = 1000;

        int c_max_realized = std::min(c_max,c_max_calc[i]);
        int o_max_realized = std::min(o_max,o_max_calc[i]);
        int n_max_realized = std::min(n_max,n_max_calc[i]);
        int p_max_realized = std::min(p_max,p_max_calc[i]);

        for (int c = c_min; c <= c_max_realized; ++c) {
            for (int h = h_min; h <= h_max; ++h) {
                for(int o = o_min; o <= o_max_realized; ++o) {
                    for (int n = n_min; n <= n_max_realized; ++n) {
                        for (int p = p_min; p <= p_max_realized; ++p) {
                            for (int na = na_min; na <= na_max; ++na) {
                                for (int cl = cl_min; cl <= cl_max; ++cl) {

                                    if(abs(mz[i] - ((c * c_em)+(h * h_em)+(o * o_em)+(n * n_em)+(p * p_em)+(na * na_em)+(cl * cl_em)) ) < error) {

                                        error = abs(mz[i] - ((c * c_em)+(h * h_em)+(o * o_em)+(n * n_em)+(p * p_em)+(na * na_em)+(cl * cl_em)) );

                                        c_ret[i] = c;
                                        h_ret[i] = h;
                                        o_ret[i] = o;
                                        n_ret[i] = n;
                                        p_ret[i] = p;
                                        na_ret[i] = na;
                                        cl_ret[i] = cl;


                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }

  return Rcpp::cbind(c_ret,Rcpp::cbind(h_ret,Rcpp::cbind(o_ret,Rcpp::cbind(n_ret,Rcpp::cbind(p_ret,Rcpp::cbind(na_ret,cl_ret))))));
}

The code runs fine and isn't slow per-se (i can assign composition to 16,000 m/z values from within R in 12.107 seconds), but I'm wondering if anyone has any ideas for how to:

  1. speed my code up, maybe by using some biologically-informed "tricks" to decrease the number of iterations performed, or else some C++ wizardry i'm not privy to (I've never used C++ before but know some java so writing this script hasn't taken too terribly long)
  2. Return the top number of elemental composition "matches" based on a user-defined variable, n (e.g. user can say, return the 1 best composition and they get 1 composition per m/z, or they can say return the 5 best compositions and they get the 5 best compositions with theoretical mass nearest the observed m/z). I'm fairly stuck for how i'd begin to do this--my only thought is an ugly one and would involve turning error into a list with

    NumericVector error(n);

and then each loop comparing like

abs(mz[i] - ((c * c_em)+(h * h_em)+(o * o_em)+(n * n_em)+(p * p_em)+(na * na_em)+(cl * cl_em)) ) < error[n]

This would get messy and complicated fast and i can't help but feel there's a better way which is eluding me.

Thanks in advance!

Brian

Edit: I'm thinking it might be best to abandon the for-loop based approach as it only works for a very fixed set/# of different elements, and would get exponentially slower with the addition of more unique elements (or isotopes). I'm sure there must be open source solutions to this problem but I'm having trouble finding them! Next step might be considering the approach used in more general such problems like this one https://stackoverflow.com/questions/20193555/finding-combinations-to-the-provided-sum-value

Actually i'm thinking this is going to relate to the Knapsack problem

Cpp Rcpp spectrometry mass • 1.6k views
ADD COMMENT
3
Entering edit mode
16 months ago
bkleiboeker ▴ 370

In case anyone stumbles on this question in the future:

I wasn't able to speed up the algorithm in C++, but I was dissatisfied with the inflexibility of the approach anyways (mostly, it wasn't remotely feasible to accomodate all elements and their isotopes using a nested for loop approach).

LChart 's suggestion of quadprog was a great nudge towards a more flexible and robust solution to this problem, but as best as I can tell quadratic-based approaches don't allow for constraining the variables to integer values only (which is necessary here since elements obviously only occur in integer numbers). After much reading, I came to believe this problem is best tackled as a linear programming problem (this SO post was very helpful, as OP was clearly approaching the same problem as me).

I did a lot of comparing, and in R the package lpsolve is the fastest of all linear programming packages, but as the above SO post highlights, often misses the best solutions for unclear reasons. lpSolveAPI is accurate but slow, and lpsymphony was both fast and accurate but crashed my R session any time I used it in lapply() across ~10 or more inputs. The package Rglpk (using Rglpk_solve_LP) ended up being the best combination of accuracy and speed, and you can see how I implemented it to solve this problem here.

Best of luck!

ADD COMMENT
1
Entering edit mode

Thanks for following up with a resolution to this thread.

I am actually amazed that you got this much feedback for what seems to be a small-scale optimization problem. If your goal was to shave hours or days of running time, I would understand the appeal of helping you. But I don't get how anyone can invest minutes or hours of their time to help you shave off seconds in your runs. Good for you that there are people like that out there!

ADD REPLY
0
Entering edit mode

I agree! Though to be honest, my actual question was only ~half "optimization for speed" -- the other half was something along the lines "I know there are proprietary softwares (Thermo XCalibur, for example) that do this way better than my C++ brute force implementation ever could, does anybody else have any ideas how those softwares work?" A big reason I asked this question at all was that I figured there was some well established way to solve this problem which for some reason I couldn't find in my own google searches, so I hoped somebody would chime in to that end.

Luckily I ended up enjoying finding the more elegant and faster solution and hopefully it helps others in the future! Maybe i'll edit the title of the post to reflect that it's more about figuring out how to assign elemental composition to m/z in general and less about optimizing for loops in C++.

ADD REPLY
2
Entering edit mode
19 months ago
LChart 4.6k

I'm not an expert in biochemistry; but this looks an awful lot like a least-squares program (square the error rather than take absolute value), so you should be able to use the quadprog package to solve such allocation problems, at least approximately.

ADD COMMENT
0
Entering edit mode

This looks very promising, thanks! I hadn't thought of this problem as a least-squares problem yet, but I was nervous the solution might indeed involve matrices

ADD REPLY
0
Entering edit mode

Do you have any advice for manipulating polynomials in R, and/or how to build the Quadratic objective term (H) and Linear objective term (f) matricies dynamically so that I can write a general function which can handle a variable number of variables?

Or does this seem unfeasible, in which case I could just construct these matricies for all elements in periodic table and set coefficients to zero for all elements which the user has NOT selected...

Currently, I'm trying to use the mvp package to convert the user input into a quadratic equation and then extract the coefficients and variables as needed to construct H and f, but I'm really struggling, and it seems unlikely that this will yield an elegant and robust solution.

I think you are right that this is best approached as a least squares problem, so thank you for that!

ADD REPLY
1
Entering edit mode

My general advice would be to have the function take in a list as input, and to use rbind to build up the matrices variable by variable. I think further discussion in this direction would be better suited for a more interactive chat, such as here, prior to taking the thread back up on this forum.

ADD REPLY

Login before adding your answer.

Traffic: 2636 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6