Question

Please share the best tool for de novo motif discovery in large dataset

1

Entering edit mode

10.4 years ago

seta ★ 1.9k

Hi everybody,

Could anybody please let me know what is the best tool for de novo motif discovery in large dataset, say 50 Mb sequencing file with some sequences up to 2000 bp in length? Looking forward to hearing your helpful suggestions.

genome rna-seq sequence • 4.3k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.4 years ago by seta ★ 1.9k

0

Entering edit mode

See this: 7N Motif Search Over The Genome

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by PoGibas 5.1k

0

Entering edit mode

thanks, but my mean is de novo motif discovery. I don't looking for particular motif. I found MEME is the best for short sequences, I'm looking for something like that for large sequencing data that contain almost long sequences.

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by seta ★ 1.9k

0

Entering edit mode

Have you tried to run MEME/DREME on command line using the available options?

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Tom ▴ 240

0

Entering edit mode

Yeah, but the error "dataset is too large" was appeared and it doesn't work even by changing "-maxsize". Besides, as far as I read, MEME randomly just select 600 bp from input sequence and find the motif on central 100 bp. So, how it can work well for sequences up to 2000 or more in length?

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by seta ★ 1.9k

Ram · Answer 1 · 2015-03-31

Detection rate for any individual motif prediction tool alone is bad whether its is for small or large data sets. The best approach is to use a combination different tools to get more reliable results.

Some of the best ranked ones are: Meme, MotifSampler and Weeder (ref: Tompa et al., Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotechnology, 23,1,137-144)

Adjust the parameters for each of these tools to maximize true positives (based on any training data set). De novo prediction results can be very sensitive to these parameter settings.

Ram · Answer 2 · 2015-03-31

RSAT (Regulatory Sequence Analysis Tools) comprises a wide collection of modular tools for the detection of cis-regulatory elements in genome sequences. Thirteen new programs have been added to the 30 described in the 2008 NAR Web Software Issue, including an automated sequence retrieval from EnsEMBL (retrieve-ensembl-seq), two novel motif discovery algorithms (oligo-diff and info-gibbs), a 100-times faster version of matrix-scan enabling the scanning of genome-scale sequence sets, and a series of facilities for random model generation and statistical evaluation (random-genome-fragments, random-motifs, random-sites, implant-sites, sequence-probability, permute-matrix). Our most recent work also focused on motif comparison (compare-matrices) and evaluation of motif quality (matrix-quality) by combining theoretical and empirical measures to assess the predictive capability of position-specific scoring matrices. To process large collections of peak sequences obtained from ChIP-seq or related technologies, RSAT provides a new program (peak-motifs) that combines several efficient motif discovery algorithms to predict transcription factor binding motifs, match them against motif databases and predict their binding sites. Availability (web site, stand-alone programs and SOAP/WSDL (Simple Object Access Protocol/Web Services Description Language) web services): http://rsat.ulb.ac.be/rsat/.

http://pedagogix-tagc.univ-mrs.fr/rsat/

Ram · Answer 3 · 2015-01-26

0

Entering edit mode

10.4 years ago

Jorge Amigo 14k

perl -ne '/motif/ and print' file

awk '/motif/' file

...

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.4 years ago by Jorge Amigo 14k

0

Entering edit mode

Thanks Jorge. would you please let me know the source of program and some detail?

ADD REPLY • link 10.4 years ago by seta ★ 1.9k

0

Entering edit mode

awk is a bash builtin. perl is usually installed by default on most systems - you might have to install it if your system has never used perl before.

ADD REPLY • link 10.4 years ago by Ram 45k

0

Entering edit mode

thanks for your explanation. yeah, perl was installed, however, as you mentioned that I'm looking for de novo motif discovery tool that can handle well large dataset with some long sequence, like 2000 bp.

ADD REPLY • link 10.4 years ago by seta ★ 1.9k

0

Entering edit mode

Yeah, I realized that when I read through your post again. I think HMM or SVN based tools might help, but I haven't used any, so I'm not of much use here, unfortunately.

ADD REPLY • link 10.4 years ago by Ram 45k

0

Entering edit mode

Jorge, I think OP is looking for de novo motifs, so something HMM based might be more appropriate, no?

ADD REPLY • link 10.4 years ago by Ram 45k

0

Entering edit mode

there are plenty of ways to do motif finding. my answer was just to point out that if the question is not well described, very simple answers such as perl/awk/grep/... could be obtained. if the input and the motif is described, then the answers could be more useful.

ADD REPLY • link 10.4 years ago by Jorge Amigo 14k

0

Entering edit mode

I guess OP edited the question once they realized that it wasn't clear enough. But yeah, one has to be more specific when seeking help.

ADD REPLY • link 10.4 years ago by Ram 45k