Question

What Is The Appropriate Order For A Background Model In Motif Searches?

5

Entering edit mode

11.8 years ago

mgalactus ▴ 780

Hi,

when searching for DNA motifs in upstream sequences (for instance with the MEME suite), it is suggested to add a background model to distinguish the motif from the sequence background noise. One possibility is to use Markov Models, in which the frequency of k-mers is computed (k-order).

My question is: how to decide the value of k? A search on related literature states that it should be proportional to the putative motif length, but no clear rule of thumb is given. Our guess is that it shouldn't be too big for computing and overfitting problems.

Thanks

motif meme • 4.9k views

ADD COMMENT • link 11.8 years ago by mgalactus ▴ 780

score 7 · Answer 1 · 2013-06-26

The authors of the MEME suite have written a general rule of thumb for choosing the appropriate order for both protein and DNA searches.

Here's a significant extract:

Typically, you should not specify an order larger than 3 for DNA sequences, or larger than 2 for protein sequences. However, if your input sequences contain higher-order non-random effects that are getting in the way of motif finding, you can follow the following "rules of thumb":

Use a background model at least four orders less than the shortest motifs you are looking for. So, if you want to find motifs as short as six, I wouldn't use a model higher than order two.

For an accurate model of order N, you need to use a FASTA file as input to fasta-get-markov with at least 10 times 4(N+1) DNA characters** in it. So,

order-3 requires 2560 characters

order 4 requires 10240 characters

order 5 requires 40960 characters etc.