Question

Blast: do separators matter?

0

Entering edit mode

5.0 years ago

marongiu.luigi ▴ 730

Hello,

If I have two sequences like this:

> seq 1
CCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCG
> seq 2
TCAGCAACAGCGACATCATCCGGATAAACGCAGTGCCCGTGCGCAGCGCGATGACTGGCTGAAGAAAGAGATACAGCGCGTATACGATGAAAATCACAAGG

that should map on the same organism, and I would like to confirm using Blast, is it better to run each sequence alone or to concatenate them?

I think the second is better because it increases the number of nucleotides and reduces the false hit rate, but I would like to confirm.

Also, I should not concatenate them directly, but separate them to tell the software that this is not a continuous sequence. Shall I use N or - or something else? and how many separators to add?

For instance, if I use

CCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGNNNNNNNNNNTCAGCAACAGCGACATCATCCGGATAAACGCAGTGCCCGTGCGCAGCGCGATGACTGGCTGAAGAAAGAGATACAGCGCGTATACGATGAAAATCACAAGG

on BlastX I get as a first hit integrase [Escherichia coli O45:H2 str. 2009C-3686] but if I use

CCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCG----------CAGCAACAGCGACATCATCCGGATAAACGCAGTGCCCGTGCGCAGCGCGATGACTGGCTGAAGAAAGAGATACAGCGCGTATACGATGAAAATCACAAGG

I get IS3 family transposase [Escherichia coli]. Similar kind of protein but not the same...

blast use • 954 views

ADD COMMENT • link updated 5.0 years ago by Mensur Dlakic ★ 28k • written 5.0 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

is it better to run each sequence alone or to concatenate them?

It sounds like you've made up your mind to use the sequences together somehow, and you want us to tell you you're right.

ADD REPLY • link 5.0 years ago by Ram 44k

0

Entering edit mode

No, I would like to know whether this approach is viable or biased. The idea to concatenate was to have more specific results (a slightly better e-value) but also to ease the computation: instead of running blast (which takes a lot of time to run) n times for n sequences, launching it just once with a longer sequence -- if it is true that this streamline is going to be better than running n smaller sequences, that is

ADD REPLY • link 5.0 years ago by marongiu.luigi ▴ 730

score 3 · Accepted Answer · 2019-12-17

3

Entering edit mode

5.0 years ago

Mensur Dlakic ★ 28k

BLAST searches the database by considering local alignments. If the two sequences you have are separated by a long stretch of DNA, it won't make any difference how you enter them because they will be scored independently as HSPs (high-scoring pairs), though you will get a slightly better overall E-value from concatenation than you would from them individually.

What you entered above are two different sequences: Ns actually count as (any) bases, while gaps (-) do not. So your first composite sequence is 10 bases longer than second as it has a 10-base linker between the two sequences, which explains why you got different scores and different matches.

Assuming your sequences are similar enough to the reference genome, you should get the same result whether you concatenate them or search individually.

ADD COMMENT • link 5.0 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Thank you, so I gather I should use '-' and not 'N'

ADD REPLY • link 5.0 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

Most likely you don't need to use either - pretty sure that - characters are ignored by BLAST engine. Unless the gap between your two sequences is very short, BLAST will split them into two HSPs, as the penalty for long gap would ruin the overall score.

ADD REPLY • link 5.0 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Ok, it makes sense. And for running a concatenated sequence versus the individual sequences, would the former provide any kind of computational improvement? Thank you

ADD REPLY • link 5.0 years ago by marongiu.luigi ▴ 730