Hello,
If I have two sequences like this:
> seq 1
CCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCG
> seq 2
TCAGCAACAGCGACATCATCCGGATAAACGCAGTGCCCGTGCGCAGCGCGATGACTGGCTGAAGAAAGAGATACAGCGCGTATACGATGAAAATCACAAGG
that should map on the same organism, and I would like to confirm using Blast, is it better to run each sequence alone or to concatenate them?
I think the second is better because it increases the number of nucleotides and reduces the false hit rate, but I would like to confirm.
Also, I should not concatenate them directly, but separate them to tell the software that this is not a continuous sequence. Shall I use N
or -
or something else? and how many separators to add?
For instance, if I use
CCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCGNNNNNNNNNNTCAGCAACAGCGACATCATCCGGATAAACGCAGTGCCCGTGCGCAGCGCGATGACTGGCTGAAGAAAGAGATACAGCGCGTATACGATGAAAATCACAAGG
on BlastX I get as a first hit integrase [Escherichia coli O45:H2 str. 2009C-3686]
but if I use
CCGCAGTAACGATATCCTTCGCCAGGCTTCCGCTTATTTTGCGAAGGCGGAGTTCGACCG----------CAGCAACAGCGACATCATCCGGATAAACGCAGTGCCCGTGCGCAGCGCGATGACTGGCTGAAGAAAGAGATACAGCGCGTATACGATGAAAATCACAAGG
I get IS3 family transposase [Escherichia coli]
. Similar kind of protein but not the same...
It sounds like you've made up your mind to use the sequences together somehow, and you want us to tell you you're right.
No, I would like to know whether this approach is viable or biased. The idea to concatenate was to have more specific results (a slightly better e-value) but also to ease the computation: instead of running blast (which takes a lot of time to run) n times for n sequences, launching it just once with a longer sequence -- if it is true that this streamline is going to be better than running n smaller sequences, that is