Blast: do separators matter?
1
0
Entering edit mode
5.0 years ago

Hello,

If I have two sequences like this:

> seq 1
CCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCG
> seq 2
TCAGCAACAGCGACATCATCCGGATAAACGCAGTGCCCGTGCGCAGCGCGATGACTGGCTGAAGAAAGAGATACAGCGCGTATACGATGAAAATCACAAGG

that should map on the same organism, and I would like to confirm using Blast, is it better to run each sequence alone or to concatenate them?

I think the second is better because it increases the number of nucleotides and reduces the false hit rate, but I would like to confirm.

Also, I should not concatenate them directly, but separate them to tell the software that this is not a continuous sequence. Shall I use N or - or something else? and how many separators to add?

For instance, if I use

CCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGNNNNNNNNNNTCAGCAACAGCGACATCATCCGGATAAACGCAGTGCCCGTGCGCAGCGCGATGACTGGCTGAAGAAAGAGATACAGCGCGTATACGATGAAAATCACAAGG

on BlastX I get as a first hit integrase [Escherichia coli O45:H2 str. 2009C-3686] but if I use

CCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCG----------CAGCAACAGCGACATCATCCGGATAAACGCAGTGCCCGTGCGCAGCGCGATGACTGGCTGAAGAAAGAGATACAGCGCGTATACGATGAAAATCACAAGG

I get IS3 family transposase [Escherichia coli]. Similar kind of protein but not the same...

blast use • 953 views
ADD COMMENT
0
Entering edit mode

is it better to run each sequence alone or to concatenate them?

It sounds like you've made up your mind to use the sequences together somehow, and you want us to tell you you're right.

ADD REPLY
0
Entering edit mode

No, I would like to know whether this approach is viable or biased. The idea to concatenate was to have more specific results (a slightly better e-value) but also to ease the computation: instead of running blast (which takes a lot of time to run) n times for n sequences, launching it just once with a longer sequence -- if it is true that this streamline is going to be better than running n smaller sequences, that is

ADD REPLY
3
Entering edit mode
5.0 years ago
Mensur Dlakic ★ 28k

BLAST searches the database by considering local alignments. If the two sequences you have are separated by a long stretch of DNA, it won't make any difference how you enter them because they will be scored independently as HSPs (high-scoring pairs), though you will get a slightly better overall E-value from concatenation than you would from them individually.

What you entered above are two different sequences: Ns actually count as (any) bases, while gaps (-) do not. So your first composite sequence is 10 bases longer than second as it has a 10-base linker between the two sequences, which explains why you got different scores and different matches.

Assuming your sequences are similar enough to the reference genome, you should get the same result whether you concatenate them or search individually.

ADD COMMENT
0
Entering edit mode

Thank you, so I gather I should use '-' and not 'N'

ADD REPLY
0
Entering edit mode

Most likely you don't need to use either - pretty sure that - characters are ignored by BLAST engine. Unless the gap between your two sequences is very short, BLAST will split them into two HSPs, as the penalty for long gap would ruin the overall score.

ADD REPLY
0
Entering edit mode

Ok, it makes sense. And for running a concatenated sequence versus the individual sequences, would the former provide any kind of computational improvement? Thank you

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6