I'm not exactly sure that I've got what you want to do, but there is EMBOSS set of tools including fuzznuc see: http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html which can use these kind for patterns (similar to regex) to search a sequence.
Thanks for the response! Sorry if my question was unclear. What I’m looking for is a way to perform a BLAST search where the query allows for a variable number of ambiguous nucleotides in certain regions. Meaning, rather than specifying an exact number of Ns, I’d like to search for matches that accommodate a range of Ns.
Thanks for pointing me to fuzznuc. That looks like the right idea.
This line from the documentation is exactly the behavior I'm looking for.
Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: N(3) corresponds to N-N-N, N(2,4) corresponds to N-N or N-N-N or N-N-N-N.
However, if I'm understanding correctly, fuzznuc searches a single sequence for a pattern. I need something that could search a database of sequences for said pattern.
Yeah, my two known sequences are long enough to blast on their own. Up to this point, we actually have been blasting them separately. What I'm hoping to gain by blasting both at the same time is getting a list of reference that contain sequence 1 and sequence 2 when they are appropriately separated by Ns.
But I believe your work around is a good one. I could just find all the references in common between the two blast searches and find the ones that are separated by the expected range of Ns.
Blast only takes fasta sequences as input (and to some extent raw/bare sequence or gb/embl format). Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, but there are a few exceptions though, for instance * can be used but only when protein is the input (not nucleotide) to indicate a "stopcodon". Another one is that you can use the hyphen - as input to denote a gap of indeterminate length (not in the blast website though, only on the cmdline thus)
As SequenceServer already indicated you can likely get the info you want but not as such and thus some post-processing will be required. If the non-N stretch matches a hit blast will created a gap in the alignment where the Ns are.
When you dig in the detailed parameters of your blast search you can tweak these to better (or more directly) get the output you want, but this will require some trail-and-error (for instance playing with the gap-open en gap-extension costs) and will likely not be omni-applicable on each kind of input you might have.
One more remark: blast is a search tool, not an alignment tool!! To get the best alignment you're better off using a real alignment tool.
I'm not exactly sure that I've got what you want to do, but there is EMBOSS set of tools including
fuzznuc
see: http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html which can use these kind for patterns (similar to regex) to search a sequence.Thanks for the response! Sorry if my question was unclear. What I’m looking for is a way to perform a BLAST search where the query allows for a variable number of ambiguous nucleotides in certain regions. Meaning, rather than specifying an exact number of Ns, I’d like to search for matches that accommodate a range of Ns.
Thanks for pointing me to fuzznuc. That looks like the right idea.
This line from the documentation is exactly the behavior I'm looking for.
However, if I'm understanding correctly, fuzznuc searches a single sequence for a pattern. I need something that could search a database of sequences for said pattern.
If your sequences on either side are longer, and if you want BLASTs fuzziness for the sequences on either side, you could also:
Yeah, my two known sequences are long enough to blast on their own. Up to this point, we actually have been blasting them separately. What I'm hoping to gain by blasting both at the same time is getting a list of reference that contain sequence 1 and sequence 2 when they are appropriately separated by Ns.
But I believe your work around is a good one. I could just find all the references in common between the two blast searches and find the ones that are separated by the expected range of Ns.
The
fuzznuc
can process multiple sequences. Just pass a fasta file containing multiple sequences. from docs:http://www.vmatch.de/ allows parameters such as maxgap and hamming distance