UPDATE (2018/10/12) old contents are removed to make the point of this entry much clearer:
I made fasta files which reproduce the problem written in https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty833/5106166 & https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5 .
please use
http://miscsources.sakura.ne.jp/blast_2018/target.fas
as query. The sequence is nHd.2.3.1.t00019-RA which is mentioned in the 2015 gist post https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5 .
Please use
http://miscsources.sakura.ne.jp/blast_2018/dummydb.fas
http://miscsources.sakura.ne.jp/blast_2018/dummydb51.fas
http://miscsources.sakura.ne.jp/blast_2018/dummydb_mod.fas
as database.
I used makeblastdb and blastp command in blast+2.7.1 for windows 64 bit.
The databases contain 2 (KRX89030, WP_042303394 ) sequences which are family of sequences mentioned in
https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5
However,
dummydb.fas contains 52 copies of KRX89030.1 .
dummydb51.fas contains 51 copies of KRX89030.1 .
and both have one copy of WP_042303394.1 .
dummydb_mod.fas contains 52 copies of KRX9030.1 and difference with dummydb.fas will be explained later.
Why 52 copies are needed is because BLAST decides number of hit sequences to retain (after the firs step? not sure) as follows
blast_hits.c
prelim_hitlist_size = hit_options->hitlist_size;
if (ext_options->compositionBasedStats)
prelim_hitlist_size = prelim_hitlist_size * 2 + 50;
else if (scoring_options->gapped_calculation)
prelim_hitlist_size = MIN(2 * prelim_hitlist_size,
prelim_hitlist_size + 50);
(*retval)->prelim_hitlist_size = MAX(prelim_hitlist_size, 10);
Using dummydb.fas, the problem is reproducible.
C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb.fas -outfmt 6 -max_target_seqs 2
nHd.2.3.1.t00019-RA WP_042303394.1 58.678 121 49 1 4 124 93 212 7.99e-047 153
nHd.2.3.1.t00019-RA seq52 63.115 122 45 0 1 122 105 226 3.66e-042 140
C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb.fas -outfmt 6 -max_target_seqs 1
nHd.2.3.1.t00019-RA seq52 63.115 122 45 0 1 122 105 226 3.66e-042 140
yey.
Using dummydb51.fas, the problem does not occur.
C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb51.fas -outfmt 6 -max_target_seqs 2
nHd.2.3.1.t00019-RA WP_042303394.1 58.678 121 49 1 4 124 93 212 7.84e-047 153
nHd.2.3.1.t00019-RA seq51 63.115 122 45 0 1 122 105 226 3.59e-042 140
C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb51.fas -outfmt 6 -max_target_seqs 1
nHd.2.3.1.t00019-RA WP_042303394.1 58.678 121 49 1 4 124 93 212 7.84e-047 153
If the problem is caused by the ungapped alignment step as NCBI staff mentioned (update 2018/11/03 according to the https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_ it performs "a gapped extension based on the gap free extension") , when the sequence showed higher score with ungapped alignment, the problem will not occur.
Apparently, WP_042303394.1 has lower score than KRX89030.1 with ungapped mode.
C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb.fas -outfmt 6 -max_target_seqs 53 -comp_based_stats F -ungapped
...omitted...
nHd.2.3.1.t00019-RA seq3 63.115 122 45 0 1 122 105 226 1.32e-052 194
nHd.2.3.1.t00019-RA seq2 63.115 122 45 0 1 122 105 226 1.32e-052 194
nHd.2.3.1.t00019-RA seq1 63.115 122 45 0 1 122 105 226 1.32e-052 194
nHd.2.3.1.t00019-RA WP_042303394.1 60.000 105 42 0 4 108 93 197 8.05e-042 158
nHd.2.3.1.t00019-RA WP_042303394.1 47.368 19 10 0 106 124 194 212 0.020 27.6
Then if I modify WP_042303394.1 to have higher score with ungapped mode, the problem does not occur. dummy_mod.fas contains modified WP_042303394.1, which was named A_WP_042303394.1
C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb_mod.fas -outfmt 6 -max_target_seqs 2 -comp_based_stats F -ungapped
nHd.2.3.1.t00019-RA A_WP_042303394.1 63.636 121 44 0 4 124 93 213 3.99e-054 199
nHd.2.3.1.t00019-RA seq52 63.115 122 45 0 1 122 105 226 1.33e-052 194
C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb_mod.fas -outfmt 6 -max_target_seqs 2
nHd.2.3.1.t00019-RA A_WP_042303394.1 63.636 121 44 0 4 124 93 213 1.45e-053 172
nHd.2.3.1.t00019-RA seq52 63.115 122 45 0 1 122 105 226 3.66e-042 140
C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb_mod.fas -outfmt 6 -max_target_seqs 1
nHd.2.3.1.t00019-RA A_WP_042303394.1 63.636 121 44 0 4 124 93 213 1.45e-053 172
yey!
With these results, I don't think
BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits.
Please let me know if you have any problems to reproduce my result. (Scoring system of BLAST is very complicated thus the procedure of the last part (etimating behavior using scores in ungapped mode) may not work with other sequences.)
the paper states that the order is set by the order in the database, hence if you have the "right" order relative to the search the parameter will have no effect.
To verify what the paper states take the best hit that you have, move it last in the FASTA file, reindex with blast and test again
I am kind of surprised that the note was accepted without details of what exactly was done, how it was done and what it was done with.
I just listened to a blast webinar at NCBI and it sounded like this was not an issue (at least in the example that was shown of a search with 1 max_target_seq and 10. Recommendation was not to use just 1 for any searches (use at least 10 and then filter).
IMO that recommendation is still kind of suspicious - it either works as we think it does or not - this sort of special cases are even worse (if it worked one way in some cases and the other way in some others)
that being said I cannot reproduce the behavior either:
the script prints:
so I get the same result when limiting to 1
I get the same result with blast v. 2.7.1 on linux.
Taking a random ID from same database:
now that I read the report a little better - I want to mention that my example here may not trigger the bug. As the response from NCBI reply states
it just might be that my example does not hit this scenario. Which only makes the behavior more confusing IMO, it seems to work until it does not.
Ohohoh you are right, you are right. & NCBI staff is right. please use query
which is mentioned in the 2015 gist post Perform blastp search against nr using max_target_seqs 100 and 10000 (web blast is fine).
You can see apparent difference between the two.
check the Gap numbers of the entries which are only seen in max_target_seqs 10000.
Yeeeeeeey I can sleep well good night.
fishgolden : I have done the
blastp
search againstnr
with your protein sequence with three limits 5, 100, 500. Top two hits are the same (with some differences later)nr is updated so there are much more eukaryotic genes now compared with 2015. 500 is not enough I think.
For 5, 100, 500, 1000, 5000, 10000. Searches done locally.
Sorry, I misunderstood what you said. The third or forth hits of 500 are the ones (or family of ones) which the gist author mentioned (those are from bacteria (Paraburkholderia kururiensis)). More similar eukaryotic sequences were added after the author's post.
The statistics are so close on later hits that their order may be re-arranged a bit in the display. Top two hits are clearly different. Once we get into
1e-39
and1e-40
territory those are very similar.I think the ones who address this problem as "bug" do not care about e-value differences.
They just mention the sequences exist or not.
Please perform text search for "UniRef50_A7E2S9" and "UniRef50_A0A0D9RFQ1" in this page or gist.
If the order in the database decides the order in the results, the hits appeared with "-max_target_seq 1" for ”testfas1.fas” and "testfas2.fas" must be the same.... I think.
@Istvan: I tried moving the fasta sequence around in a file, re-indexing the file. The result does not change. So
blastn
is not affected. Let me see if I can test withblastp
.now that I looked at it in more detail there is more to it, the "bug" or "behavior" is triggered only in some special cases.
You have to have an alignment that matches the statement above.
Look at the examples that Brad Chapman has on the github page and search against NR so that there is ample room for the behavior to trigger.
C: Misunderstood parameter of NCBI BLAST (Response from NCBI in NAR posted by @Vijay)
Thank you for the notification! I was very surprized that there was a bug. It made problems complicated........
Just to be certain, you are using exactly the same versions of blast/database etc as referred in the paper?
I couldn't find the version number in the paper. In fact, it's a letter made from 2 pages. There are not example commands and queries and databases which result in the problem. Thus I wrote "problem" not "results".
The other researcher uses Linux & 2.2.28~31 I'll check this tomorrow. 10/5 : I updated main text & uploaded the result file. https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5
fishgolden : We can't see the screenshots.
me too. it is the post in 3 yeas ago.
I am referring to your screenshot: https://www.dropbox.com/s/8z1kso2d53k3l5k/max500.png?dl=0
It come from the preview of the link of gist I pasted. Biostars changes it automatically. Sorry it is confusing. I changed.
genomax Thank you.
Recommendation: show the alignments as text and not images
also remove the unnecessary data listings, those make the post difficult to follow
Thank you for the suggestion. I'll do it when I have a time. (Unfortunately, I'm a little bit busy for the next several days.)