Resurrecting Dna Motif Finding Project
2
5
Entering edit mode
13.1 years ago
Faheemmitha ▴ 210

I apologize in advance for the length of this posting. Also, I'm not a biologist, so I may get some biological details wrong. Corrections gratefully accepted.

Some years ago I worked on a bioinformatics project for some biologists which was abandoned. I'm now considering whether this project is salvageable. The project was about DNA motif finding. I'm writing to ask some questions in regard to this.

Based on a review of the literature, it seems motif finding is focused more (perhaps mainly) on the following problem. I certainly seemed to find more of an emphasis on this in the literature.

PROBLEM A: Given a collection of strings, find some substrings (motifs) that these sequences have in common. Since these strings are DNA sequences, these strings are from a 4 letter alphabet. Usually (it seems to me) the structure of these substrings is not known in advance.

One does not look for an exact substring match across all the strings, since there are mutations. I have not been able to find a precise description for what kind of variability is allowed for. Perhaps there is no such precise description.

However, the project that I worked on is designed to do something different, namely

PROBLEM B: Given a set of known and related (in a biological sense) motifs of fixed length n, try to pick up some common structure in these motifs (usually expressed as a probabilistic model), which one can then apply to trying to find other related motifs in a much longer sequence, e.g. a genome.

I won't go into details about the model here, but basically the supposition is that changes in a particular position may be correlated to changes in another position. I don't know if this is actually the case in practice, and it may not be easy to establish. One could perhaps look at pairwise correlations for particular data sets. Does anyone know if this has been studied at all?

I devised a genetic algorithm which searches the space of all models (of a specific type) to find the model that is the best fit to the data.

Given a model, one can then walk ones way down a long sequence, calculating the maximum likelihood of each segment, and then rank them. The hope is that the original sequences are highly ranked, and that the other highly ranked sequences are of interest.

Since there are many possible models of length n, this method doesn't work well unless there are many motifs to use as a training set, and the common length of the motifs is relatively short.

If one uses data simulated from the model, and then tries to recover the original model from the simulated data, this works well as long as the set of data is large enough. Around 300-500 rows (motifs) for lengths 20-30 work well. After 30, things blow up fast. Of course, even 300 is a lot to ask of real biological data. Also, of course, there is no way of knowing whether the real biological data actually behaves like the simulated data, so I don't think the simulated data set is useful in determining how well this method will perform on actual data.

My impression is that most motif finding is heuristic in nature, so nobody is going to care too much about the origin of the method as long as it appears to return something useful. So the question is whether this method does anything useful on actual biological data. One concern I have is that the original data will be highly ranked because of conserved regions. Perhaps one could add a "control" dummy dataset which has the same conserved regions as the real dataset but randomly generated data otherwise, and see how that compares.

So here is my main question, which is a two parter.

a) Can anyone point me to general discussions in the literature (a survey paper would be nice), of examples of what I have called PROBLEM B in a biological context? I haven't come across anything. A method is of no use without applications.

b) Where can I find benchmarks by which I can measure performance of this tool? Sufficiently large training sets of sufficiently short motifs with comparative results by other tools would be useful, for example.

Additionally, if anyone could suggest other and possibly more appropriate forums to ask this question, I'd appreciate it.

Also, of course, if you actually have any interest in the scenario sketched above and/or relevant data, I'd be glad to hear about it.

UPDATE: Edited question to clarify that the project was about DNA motif finding, as opposed to say, protein sequences. While the approach could presumably be used on protein sequences, I have not tried it. Regardless, I would be interested in related information about any kind of motif finding, not just for DNA motifs.

motif motif motif • 5.1k views
ADD COMMENT
2
Entering edit mode
13.1 years ago

My colleague worked on a project that seems to be related to the problem B that you are describing:

Horan, K., Shelton, C. R., Girke, T., 2010. Predicting conserved protein motifs with Sub-HMMs. BMC Bioinformatics 11, 205­205. URL http://www.hubmed.org/display.cgi?uids=20420695

ADD COMMENT
0
Entering edit mode

Thanks, Aleksandr. I'll take a look.

ADD REPLY
0
Entering edit mode

Thanks for the reference, Aleksandr.

ADD REPLY
2
Entering edit mode
13.1 years ago
Fabian Bull ★ 1.3k

I had a course about sequence analysis which teached your approach. It helps to understand more complex approaches.

Given a set of motifs one can train a mixture of different statistical models (for sequence analysis) such as PWMs, WAMs and higher-order Markov Models. For training of these models you have to derive an EM algorithm or a Gibbs Sampler. After training all models you calculate likelihoods of a subsequence sliding over the sequence and classify according a likelihood ratio.

However, this approach is a subproblem to de novo motif finding because most de novo motif programs use bayesian statistics to benefit from prior knowledge.

EDIT: The statement of Problem B being a subset of de novo motif finding is formally incorrect but in a practical sense you can pretty much say so. This is because you can set the parameters of gibbs sampling or EM algorithm in a way that these algorithms will only find motifs you have given to them. These topic is often reffered to as 'pseudo counts' and 'dirichlet prior'.

ADD COMMENT
0
Entering edit mode

Thanks, peri4n. Do you have a reference for the material you mention?

ADD REPLY
0
Entering edit mode

Are you searching for books or papers? In case you search papers I think there are none. Books describing the basics are Sequence Data Mining (Advances in Database Systems) by Guozhu Dong, Jian Pei or Statistical Methods in Bioinformatics: An Introduction (Statistics for Biology and Health) by Warren J. Ewens und Gregory R. Grant.

ADD REPLY
0
Entering edit mode

Hi. peri4n. Thanks very much for your message and the references. I don't have currently have access to a good library, so I was hoping for online material. Are you saying that Problem B is a subset of Problem A, because it seems to me they are differenr problems, though Problem A may be a harder and more general problem. Can you expand your answer to elaborate on this? Thanks.

ADD REPLY
0
Entering edit mode

Hi, peri4n. I found that "Statistical Methods in Bioinformatics" was available cheap locally, so I ordered it. Thanks for the recommendation.

ADD REPLY
0
Entering edit mode

I may write an article about this (as summary for myself). I'll contact you when I am done.

ADD REPLY
0
Entering edit mode

Hi peri4n, Thanks for your interesting further comment. Can you suggest a reference where the connection of Problem B and de novo motif finding (as you describe it) is discussed in a systematic way?

ADD REPLY

Login before adding your answer.

Traffic: 2701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6