I am fitting for position specific scoring matrices (PSSM aka Position Specific Weight Matrices). The fit I'm using is like simulated annealing, where I the perturb the PSSM, compare the prediction to experiment and accept the change if it improves agreement. This means I apply the PSSM millions of times per fit; performance is critical. In my particular problem, I'm applying a PSWM for an object of length L
(~8 bp) at every position of a DNA sequence of length M
(~30 bp) (so there are M-L+1
valid positions). (Tags apparently don't permit the '+' symbol, but I'm writing in C++ not C.)
My best idea is to convert the DNA into some kind of a matrix so that applying the PSWM is matrix multiplication. There are efficient linear algebra libraries out there (e.g. BLAS), but I'm not sure how best to turn an M-length DNA sequence into a matrix M x 4
matrix and then apply the PSSM at each position. The solution needs to work for higher order/dinucleotide terms in the PSSM - presumably this means representing the sequence-matrix for mono-nucleotides and separately for dinucleotides.
My current solution iterates over each position m
, then over each letter in word from m
to m+L-1
, adding the corresponding term in the matrix. I'm storing the matrix as a multi-dimensional STL vector, and profiling has revealed that a lot of the computation time is just accessing the elements of the PSSM (with similar performance bottlenecks accessing the DNA sequence).
If someone has an idea besides matrix multiplication, I'm all ears.
Thanks, but their algorithm for applying the PSSM is basically the same as mine (though the task to which I'm applying the PSSM's is different from what MOODS is aimed at). The sequence and score matrices are both represented as STL vectors in this code, but this is where I'm starting, what I'm trying to improve on.
Well, if you're able to find a better algorithm somewhere, can you link it here? I'd be interested in it too.