I apologize in advance for the length of this posting. Also, I'm not a biologist, so I may get some biological details wrong. Corrections gratefully accepted.
Some years ago I worked on a bioinformatics project for some biologists which was abandoned. I'm now considering whether this project is salvageable. The project was about DNA motif finding. I'm writing to ask some questions in regard to this.
Based on a review of the literature, it seems motif finding is focused more (perhaps mainly) on the following problem. I certainly seemed to find more of an emphasis on this in the literature.
PROBLEM A: Given a collection of strings, find some substrings (motifs) that these sequences have in common. Since these strings are DNA sequences, these strings are from a 4 letter alphabet. Usually (it seems to me) the structure of these substrings is not known in advance.
One does not look for an exact substring match across all the strings, since there are mutations. I have not been able to find a precise description for what kind of variability is allowed for. Perhaps there is no such precise description.
However, the project that I worked on is designed to do something different, namely
PROBLEM B: Given a set of known and related (in a biological sense) motifs of fixed length n, try to pick up some common structure in these motifs (usually expressed as a probabilistic model), which one can then apply to trying to find other related motifs in a much longer sequence, e.g. a genome.
I won't go into details about the model here, but basically the supposition is that changes in a particular position may be correlated to changes in another position. I don't know if this is actually the case in practice, and it may not be easy to establish. One could perhaps look at pairwise correlations for particular data sets. Does anyone know if this has been studied at all?
I devised a genetic algorithm which searches the space of all models (of a specific type) to find the model that is the best fit to the data.
Given a model, one can then walk ones way down a long sequence, calculating the maximum likelihood of each segment, and then rank them. The hope is that the original sequences are highly ranked, and that the other highly ranked sequences are of interest.
Since there are many possible models of length n, this method doesn't work well unless there are many motifs to use as a training set, and the common length of the motifs is relatively short.
If one uses data simulated from the model, and then tries to recover the original model from the simulated data, this works well as long as the set of data is large enough. Around 300-500 rows (motifs) for lengths 20-30 work well. After 30, things blow up fast. Of course, even 300 is a lot to ask of real biological data. Also, of course, there is no way of knowing whether the real biological data actually behaves like the simulated data, so I don't think the simulated data set is useful in determining how well this method will perform on actual data.
My impression is that most motif finding is heuristic in nature, so nobody is going to care too much about the origin of the method as long as it appears to return something useful. So the question is whether this method does anything useful on actual biological data. One concern I have is that the original data will be highly ranked because of conserved regions. Perhaps one could add a "control" dummy dataset which has the same conserved regions as the real dataset but randomly generated data otherwise, and see how that compares.
So here is my main question, which is a two parter.
a) Can anyone point me to general discussions in the literature (a survey paper would be nice), of examples of what I have called PROBLEM B in a biological context? I haven't come across anything. A method is of no use without applications.
b) Where can I find benchmarks by which I can measure performance of this tool? Sufficiently large training sets of sufficiently short motifs with comparative results by other tools would be useful, for example.
Additionally, if anyone could suggest other and possibly more appropriate forums to ask this question, I'd appreciate it.
Also, of course, if you actually have any interest in the scenario sketched above and/or relevant data, I'd be glad to hear about it.
UPDATE: Edited question to clarify that the project was about DNA motif finding, as opposed to say, protein sequences. While the approach could presumably be used on protein sequences, I have not tried it. Regardless, I would be interested in related information about any kind of motif finding, not just for DNA motifs.
Thanks, Aleksandr. I'll take a look.
Thanks for the reference, Aleksandr.