Hi everybody,
Could anybody please let me know what is the best tool for de novo motif discovery in large dataset, say 50 Mb sequencing file with some sequences up to 2000 bp in length? Looking forward to hearing your helpful suggestions.
Hi everybody,
Could anybody please let me know what is the best tool for de novo motif discovery in large dataset, say 50 Mb sequencing file with some sequences up to 2000 bp in length? Looking forward to hearing your helpful suggestions.
Detection rate for any individual motif prediction tool alone is bad whether its is for small or large data sets. The best approach is to use a combination different tools to get more reliable results.
Some of the best ranked ones are: Meme, MotifSampler and Weeder (ref: Tompa et al., Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotechnology, 23,1,137-144)
Adjust the parameters for each of these tools to maximize true positives (based on any training data set). De novo prediction results can be very sensitive to these parameter settings.
RSAT (Regulatory Sequence Analysis Tools) comprises a wide collection of modular tools for the detection of cis-regulatory elements in genome sequences. Thirteen new programs have been added to the 30 described in the 2008 NAR Web Software Issue, including an automated sequence retrieval from EnsEMBL (retrieve-ensembl-seq), two novel motif discovery algorithms (oligo-diff and info-gibbs), a 100-times faster version of matrix-scan enabling the scanning of genome-scale sequence sets, and a series of facilities for random model generation and statistical evaluation (random-genome-fragments, random-motifs, random-sites, implant-sites, sequence-probability, permute-matrix). Our most recent work also focused on motif comparison (compare-matrices) and evaluation of motif quality (matrix-quality) by combining theoretical and empirical measures to assess the predictive capability of position-specific scoring matrices. To process large collections of peak sequences obtained from ChIP-seq or related technologies, RSAT provides a new program (peak-motifs) that combines several efficient motif discovery algorithms to predict transcription factor binding motifs, match them against motif databases and predict their binding sites. Availability (web site, stand-alone programs and SOAP/WSDL (Simple Object Access Protocol/Web Services Description Language) web services): http://rsat.ulb.ac.be/rsat/.
perl -ne '/motif/ and print' file
awk '/motif/' file
...
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
See this: 7N Motif Search Over The Genome
thanks, but my mean is de novo motif discovery. I don't looking for particular motif. I found MEME is the best for short sequences, I'm looking for something like that for large sequencing data that contain almost long sequences.
Have you tried to run MEME/DREME on command line using the available options?
Yeah, but the error "dataset is too large" was appeared and it doesn't work even by changing "-maxsize". Besides, as far as I read, MEME randomly just select 600 bp from input sequence and find the motif on central 100 bp. So, how it can work well for sequences up to 2000 or more in length?