I use blast+ package. After running blastn
I get a sequence range that I forward then to blastdbcmd
. But when command spot coordinates on minus strand it complains:
blastdbcmd -db ./database/database.fna \
-entry CM008969.1 \
-strand minus \
-range 659-392
BLAST engine error: Failed to parse sequence range (start cannot be larger than stop)
Is it necessary to "flipped" the coordinates in this case? Or it is only mistake in syntax - (hopefully, as there is -strand
parameter, that I guess should deal with the problem)?
If the "flip" is needed how to do it? Would I need to count the length of the genome and then subtract differences in distances to both ends of the strand to the respect of the opposite strand?
-------------------------------------------------------- EDIT -----------------------------------------------------------------------------------------------
query:
>U6
TTGTGCTTCGGCACAACTGTTAAAATTGGAAAAATACAGAGAAGATTAGCATGGCCCCTGCGCAAGGATGACACGCATAAATCGAGAAATGGTTCCAAATTTTT
database made from genome: https://www.ncbi.nlm.nih.gov/assembly/GCA_000002595.3
when using blastn result looks like:
# Fields: query id, subject id, % identity, % query coverage per subject, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 11 hits found
U6 CM008969.1 100.00 100 104 0 0 1 104 3788683 3788580 6e-49 193
U6 CM008969.1 99.02 100 102 1 0 3 104 3473456 3473557 4e-46 183
U6 CM008969.1 96.04 100 101 4 0 1 101 2474443 2474543 1e-40 165
U6 CM008969.1 98.53 100 68 1 0 37 104 3476470 3476403 3e-27 121
U6 CM008969.1 98.53 100 68 1 0 37 104 3476629 3476562 3e-27 121
U6 CM008969.1 98.53 100 68 1 0 37 104 3761506 3761573 3e-27 121
U6 CM008969.1 98.53 100 68 1 0 37 104 3761687 3761754 3e-27 121
U6 CM008969.1 98.53 100 68 1 0 37 104 3761869 3761936 3e-27 121
U6 CM008969.1 98.53 100 68 1 0 37 104 3771731 3771664 3e-27 121
U6 CM008969.1 98.53 100 68 1 0 37 104 3772741 3772674 3e-27 121
U6 CM008969.1 97.06 100 68 2 0 37 104 3772559 3772492 1e-25 115
and now, using 4th record from above (as it is the first hit on "minus" strand) in blastdbcmd
as follow:
blastdbcmd -db ./database/database.fna \
-entry CM008969.1 \
-strand minus \
-range 3476403-3476470 \ #flipped coordinates
-outfmt %s
gives:
CAGAGAAGATTAGCATGGCCCCTGCGCAAGGATGACACGCACAAATCGAGAAATGGTTCCAAATTTTT
and finally, those sequence as test
serves as a new query against the same database gives results:
# Fields: query id, subject id, % identity, % query coverage per subject, alignment length, mismatches, gap opens,
q. start, q. end, s. start, s. end, evalue, bit score
# 12 hits found
test CM008969.1 100.00 100 68 0 0 1 68 3473490 3473557 4e-29 126
test CM008969.1 100.00 100 68 0 0 1 68 3476470 3476403 4e-29 126
test CM008969.1 100.00 100 68 0 0 1 68 3476629 3476562 4e-29 126
test CM008969.1 100.00 100 68 0 0 1 68 3761506 3761573 4e-29 126
test CM008969.1 100.00 100 68 0 0 1 68 3761687 3761754 4e-29 126
test CM008969.1 100.00 100 68 0 0 1 68 3761869 3761936 4e-29 126
test CM008969.1 100.00 100 68 0 0 1 68 3771731 3771664 4e-29 126
test CM008969.1 100.00 100 68 0 0 1 68 3772741 3772674 4e-29 126
test CM008969.1 98.53 100 68 1 0 1 68 3772559 3772492 2e-27 121
test CM008969.1 98.53 100 68 1 0 1 68 3788647 3788580 2e-27 121
test CM008969.1 96.92 100 65 2 0 1 65 2474479 2474543 4e-24 110
test CM008969.1 97.37 100 38 1 0 31 68 3773539 3773502 8e-1165.8
At the end, I would expect the same stats results, but they are not the same:
U6 CM008969.1 98.53 100 68 1 0 37 104 3476470 3476403 3e-27 121
vs
test CM008969.1 100.00 100 68 0 0 1 68 3473490 3473557 4e-29 126
Why not try to flip the numbers and give it a try? Strand specification should be adequate to get the correct range. You may need to reverse complement the result if the query returns top strand sequence.
When I first blasted my original query with the reference genome I got 100% identity and query coverage on "minus" strand.
But when I took coordinates from this hit and put them to
blastdbcmd
with flipped coordinates and-strand minus
I got 2 mismatches, identity - 97% and query coverage 98%.If it would be only about flipping the coordinates then results should be the same, right?
With
blastdbcmd
you are just recovering sequence from an entry that is in your database. I am not sure what you are referring to in terms of mismatches/identity.Exactly. My "recovered" sequence is different from original - this what I wanted to show by the identity and coverage query.
I will have to look at that but it does not seem logical why that should happen. Can you post examples of the alignments? What does
original
mean here?Please see my edit - I posted the example there.
If you search with a sequence that is already present in the database you are bound to get 100% hit to entire sequence which is what you get in your
test
. Since there are 8 hits with 100% identity, they are listed based on starting nucleotide (sorted) in hit.Second hit in your
test
results:perfectly matches your original hit co-ordinates
As for differences between
U6
andtest
(there is one base pair difference)I haven't used this package, but if you have specified strand is "minus", why do you specify the range as "end to start"?
. <- range ->
A T T C C T C C C A G +
T A A G G A G G G T C - (strand)
Like this, the range just determines which part of double-stranded info will be used, and strand determines which strand will be used
This is how the previously used tool
blastn
from the same package is giving the coordinates of an alignment. I suppose they should work together, I mean the format should be unified within one package.