I'm trying to use MEME for motif discovery on reads from a high throughput sequencing experiment. If I use MEME on something around the order of 400k reads, it becomes unbearably slow, and even if I drop to ~10k sequences it's pretty slow. I am using these parameters:
-dna -text -nmotifs 30 -maxsize 100000000 -maxw 15
The large -maxsize
is required to make it run on so many sequences. I restricted motif width to 15
and the number of motifs to discover nmotifs
to 30. Which of these make large differences in speed? That will help me optimize. Is it at all possible to use MEME on millions of sequences? What is the upper bound that's practical?
I would also like to try Homer so if anyone has thoughts on Homer speed and parameters that particularly affect the speed I would like to know. thanks.
Concretely, the running time of MEME grows as the square of the total number of characters of the sequence and the cube of the number of sequences. This makes running MEME on more than about 10,000 sequences impractical on commodity hardware. MEME-ChIP works around this by sampling sequences from the input set and running MEME on only the sampled sequences. DREME's running time grows roughly linearly with the number of characters in the sequence data, but it's limited to motifs of width 8 or less.
Many thanks for the detailed explanation!