Question

Blast with variable number of Ns

0

Entering edit mode

14 days ago

samuel.himes • 0

I've seen people blast sequences with N filling in for sequence that can match anything.

For example

TTC...GTCNNNNNNNNNTCA...CTG

Is there any way to make the number of Ns variable?

TTC...GTCN*TCA...CTG

Or even cooler if you can specify a range

TTC...GTCN{5,15}TCA...CTG

Does anyone have a method for accomplishing something like this?

blast • 656 views

ADD COMMENT • link updated 8 days ago by lieven.sterck 15k • written 14 days ago by samuel.himes • 0

1

Entering edit mode

I'm not exactly sure that I've got what you want to do, but there is EMBOSS set of tools including fuzznuc see: http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html which can use these kind for patterns (similar to regex) to search a sequence.

ADD REPLY • link 13 days ago by massa.kassa.sc3na ▴ 650

0

Entering edit mode

Thanks for the response! Sorry if my question was unclear. What I’m looking for is a way to perform a BLAST search where the query allows for a variable number of ambiguous nucleotides in certain regions. Meaning, rather than specifying an exact number of Ns, I’d like to search for matches that accommodate a range of Ns.

Thanks for pointing me to fuzznuc. That looks like the right idea.

This line from the documentation is exactly the behavior I'm looking for.

Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: N(3) corresponds to N-N-N, N(2,4) corresponds to N-N or N-N-N or N-N-N-N.

However, if I'm understanding correctly, fuzznuc searches a single sequence for a pattern. I need something that could search a database of sequences for said pattern.

ADD REPLY • link 13 days ago by samuel.himes • 0

1

Entering edit mode

If your sequences on either side are longer, and if you want BLASTs fuzziness for the sequences on either side, you could also:

split the sequence in two: I.e. have run blast with TTC...GTCN, and separately with TCA...CTG
and then parse the BLAST output (table/json/xml) to figure out the distance.

ADD REPLY • link 13 days ago by SequenceServer ▴ 150

0

Entering edit mode

Yeah, my two known sequences are long enough to blast on their own. Up to this point, we actually have been blasting them separately. What I'm hoping to gain by blasting both at the same time is getting a list of reference that contain sequence 1 and sequence 2 when they are appropriately separated by Ns.

But I believe your work around is a good one. I could just find all the references in common between the two blast searches and find the ones that are separated by the expected range of Ns.

ADD REPLY • link 13 days ago by samuel.himes • 0

1

Entering edit mode

The fuzznuc can process multiple sequences. Just pass a fasta file containing multiple sequences. from docs:

% fuzznuc 
Search for patterns in nucleotide sequences

ADD REPLY • link 10 days ago by massa.kassa.sc3na ▴ 650

1

Entering edit mode

http://www.vmatch.de/ allows parameters such as maxgap and hamming distance

ADD REPLY • link 10 days ago by Jeremy Leipzig 22k

score 3 · Accepted Answer · 2024-12-16

short answer: No, blast is not able to do that.

Blast only takes fasta sequences as input (and to some extent raw/bare sequence or gb/embl format). Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, but there are a few exceptions though, for instance * can be used but only when protein is the input (not nucleotide) to indicate a "stopcodon". Another one is that you can use the hyphen - as input to denote a gap of indeterminate length (not in the blast website though, only on the cmdline thus)

As SequenceServer already indicated you can likely get the info you want but not as such and thus some post-processing will be required. If the non-N stretch matches a hit blast will created a gap in the alignment where the Ns are. When you dig in the detailed parameters of your blast search you can tweak these to better (or more directly) get the output you want, but this will require some trail-and-error (for instance playing with the gap-open en gap-extension costs) and will likely not be omni-applicable on each kind of input you might have.

One more remark: blast is a search tool, not an alignment tool!! To get the best alignment you're better off using a real alignment tool.