Question

Parameters That Determine Meme Speed And Homer Speed For Motif Discovery?

3

Entering edit mode

12.2 years ago

user ▴ 960

I'm trying to use MEME for motif discovery on reads from a high throughput sequencing experiment. If I use MEME on something around the order of 400k reads, it becomes unbearably slow, and even if I drop to ~10k sequences it's pretty slow. I am using these parameters:

-dna -text -nmotifs 30 -maxsize 100000000 -maxw 15

The large -maxsize is required to make it run on so many sequences. I restricted motif width to 15 and the number of motifs to discover nmotifs to 30. Which of these make large differences in speed? That will help me optimize. Is it at all possible to use MEME on millions of sequences? What is the upper bound that's practical?

I would also like to try Homer so if anyone has thoughts on Homer speed and parameters that particularly affect the speed I would like to know. thanks.

meme motif motif sequence • 9.0k views

ADD COMMENT • link updated 11.3 years ago by Biostar 20 • written 12.2 years ago by user ▴ 960

score 3 · Answer 1 · 2013-03-03

3

Entering edit mode

12.2 years ago

Mikael Huss 4.8k

You are not supposed to use MEME on this kind of scale (tens of thousands of sequences). It will work well for a small set of sequences, say promoter sequences extracted from a list of differentially expressed genes. The number of input sequences, and the length of the input sequences, are what mostly determine how long it will take to run.

DREME from the same group is more geared towards ChIP-seq scale data, that is, tens of thousands of regions. HOMER can also handle large sets of sequences.

But most importantly, you may have to rethink your approach a bit. You seem to be trying to do motif discovery on raw reads, but does that really make sense? If you have hundreds of reads coming from the same genomic locus, isn't it better to just feed that locus into a motif finding algorithm rather than the raw reads, which just risks overwhelming the software? But maybe you have your reasons for doing it that way.

ADD COMMENT • link 12.2 years ago by Mikael Huss 4.8k

5

Entering edit mode

Concretely, the running time of MEME grows as the square of the total number of characters of the sequence and the cube of the number of sequences. This makes running MEME on more than about 10,000 sequences impractical on commodity hardware. MEME-ChIP works around this by sampling sequences from the input set and running MEME on only the sampled sequences. DREME's running time grows roughly linearly with the number of characters in the sequence data, but it's limited to motifs of width 8 or less.

ADD REPLY • link 12.2 years ago by charlesegrant ▴ 70

1

Entering edit mode

Many thanks for the detailed explanation!

ADD REPLY • link 12.2 years ago by Mikael Huss 4.8k

score 3 · Answer 2 · 2013-03-04

3

Entering edit mode

12.2 years ago

Ryan Dale 5.0k

MEME-ChIP runs a collection of the MEME suite programs on ChIP-seq-scale data. From the docs, it looks like the limit (for the server, at least) is a file size limit of 50 MB:

"MEME-ChIP is designed especially for discovering motifs in LARGE (50MB maximum) sets of short (around 500bp) DNA sequences centered on locations of interest such as those produced by ChIP-seq experiments."

(For more, see this Recommended Tools For De Novo Motif Discovery In Vertebrate Genome Tsses?)

ADD COMMENT • link 12.2 years ago by Ryan Dale 5.0k

0

Entering edit mode

Hi, Ryan, I found that MEME-chip requires equal-length nucleotide sequences to be analyzed (http://meme-suite.org/tools/meme-chip). However, a bed file format containing identified clusters from my CLIP experiment are not equal length (e.g. 30 to 50 bp in one of my sample and 30 to 300bp in another sample). I've never do ChIP-seq, but why clusters in ChIP-seq could be the same length as 500 bp? It seems to me clusters are highly likely at different lengths rather than the same length.

ADD REPLY • link 5.1 years ago by xiaoleiusc ▴ 140

score 2 · Answer 3 · 2013-03-03

Following up on Mikael's answer, you may have reasons for doing what you're doing. If you have access to a computational cluster, you might look into compiling meme_p, a variant of meme that incorporates OpenMPI components to spread out the work on multiple nodes.

You might build it like so:

$ cd /home/foo/meme_4.9.0
$ ./configure \
    --prefix=/home/foo/meme_4.9.0 \ 
    --with-url="http://meme.nbcr.net/meme" \
    --enable-openmp \
    --enable-debug \
    --with-mpicc=/opt/openmpi-1.6.3/bin/mpicc \
    --enable-opt

You need to add the OpenMPI lib path to your LD_LIBRARY_PATH environment variable, _e.g._ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/openmpi-1.6.3/lib etc. in your environment setup. Your OpenMPI installation must also be present on or available to each cluster node.

A Sun Grid Engine-based script called runall.cluster would fire off the search as follows, supposing your sys admin has set up a parallel environment called mpi_pe (for example) with at least 64 slots:

#!/bin/bash

#
# runall.cluster
#

#$ -N memeCluster64
#$ -S /bin/bash
#$ -pe mpi_pe 64
#$ -v -np=64
#$ -cwd
#$ -o "memeCluster64.out"
#$ -e "memeCluster64.err"
#$ -notify
#$ -V

time /opt/openmpi-1.6.3/bin/mpirun \
       -np 64 \
       /home/foo/meme_4.9.0/bin/meme_p \
           /home/foo/meme_4.9.0/data/myReads.fa \
           -oc /home/foo/meme_4.9.0/output/myReads.fa.meme \
           -dna \
           -text \
           -nmotifs 30 \
           -maxsize 100000000 \
           -maxw 15

To run it:

$ qsub ./runall.cluster

In our environment, testing showed immediate benefit with as few as 8 or 16 nodes, with diminishing returns after about 32-64 nodes. You could use GNU time to do the same runtime testing on your end, i.e., measuring execution time vs nodes on a small test sequence set, in order to find a "sweet spot" where your job will run faster without taking up too much of the cluster.