Hi All,
I want to search short nucleotide motif (30-60 bp) against millions of long sequences (400-500 bp), which algorithm and program/tool (standalone) would be good for me?
Thanks for any help!
Hi All,
I want to search short nucleotide motif (30-60 bp) against millions of long sequences (400-500 bp), which algorithm and program/tool (standalone) would be good for me?
Thanks for any help!
Hi- by now you might have found a solution to this question anyway...
A while ago a wrote a program, SequenceMatcher which might suite you. In your case you could do something like:
java -jar ~/path/to/SequenceMatcher.jar match -a motif.fa -b sequences.fa -aln local
The output is in a easy-to-parse tabular format or in SAM format. For 1 vs a 1M sequences it might be slow but not terrible.
For short motifs to be aligned to long sequences I got good results with MEME. You can download a stand-alone version and run it on a local computer. The suite has also some nice tools for working with motifs - e.g. search similar motifs in the JASPAR and Uniprobe database, find enrichments of multiple motifs, and so on. It depends on what you want to do.
If you are only interested in an alignment you can use exonerate, a nice command line tool that doesn't require much installation and has many options for different types of alignment. It is specially good if you need to align cDNA sequences to genomic DNA, because it has a model for handling large introns and identifying exon junctions. But it is good for other types of alignments as well.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I think you should refine your question. In particular: How many mismatches do you want to allow between motif and target? How many motif do you have, just a few say < 1000, or more like in the millions? Are you happy to know that a motif is present in a target sequence, or you want also the best alignment?
In the simplest case (no mismatches, few motif, and just look for presence/absence) something as easy as a grep command could do the job.
Hi, basically I have one motif (query) that i want to align against each of long sequences (custom db of million reads) and I need standard tabular output like blast (%identity, alignment length etc)...I found that blast is not a good option..i probably need smith-waterman based tool, not sure though..
If you write motif, do you mean a sequence or a motif with ambiguities?
thanks, I meant sequence..