(solved) I couldn't reproduce the problem of max_target_seqs
3
4
Entering edit mode
6.3 years ago
fishgolden ▴ 520

UPDATE (2018/10/12) old contents are removed to make the point of this entry much clearer:

I made fasta files which reproduce the problem written in https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty833/5106166 & https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5 .

please use

http://miscsources.sakura.ne.jp/blast_2018/target.fas

as query. The sequence is nHd.2.3.1.t00019-RA which is mentioned in the 2015 gist post https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5 .

Please use

http://miscsources.sakura.ne.jp/blast_2018/dummydb.fas

http://miscsources.sakura.ne.jp/blast_2018/dummydb51.fas

http://miscsources.sakura.ne.jp/blast_2018/dummydb_mod.fas

as database.

I used makeblastdb and blastp command in blast+2.7.1 for windows 64 bit.

The databases contain 2 (KRX89030, WP_042303394 ) sequences which are family of sequences mentioned in

https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5

However,

dummydb.fas contains 52 copies of KRX89030.1 .

dummydb51.fas contains 51 copies of KRX89030.1 .

and both have one copy of WP_042303394.1 .

dummydb_mod.fas contains 52 copies of KRX9030.1 and difference with dummydb.fas will be explained later.

Why 52 copies are needed is because BLAST decides number of hit sequences to retain (after the firs step? not sure) as follows

blast_hits.c

   prelim_hitlist_size = hit_options->hitlist_size;
   if (ext_options->compositionBasedStats)
        prelim_hitlist_size = prelim_hitlist_size * 2 + 50;
   else if (scoring_options->gapped_calculation)
        prelim_hitlist_size = MIN(2 * prelim_hitlist_size,
                                  prelim_hitlist_size + 50);

   (*retval)->prelim_hitlist_size = MAX(prelim_hitlist_size, 10);

Using dummydb.fas, the problem is reproducible.

C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb.fas -outfmt 6 -max_target_seqs 2
nHd.2.3.1.t00019-RA     WP_042303394.1  58.678  121     49      1       4       124     93      212     7.99e-047       153
nHd.2.3.1.t00019-RA     seq52   63.115  122     45      0       1       122     105     226     3.66e-042       140

C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb.fas -outfmt 6 -max_target_seqs 1
nHd.2.3.1.t00019-RA     seq52   63.115  122     45      0       1       122     105     226     3.66e-042       140

yey.

Using dummydb51.fas, the problem does not occur.

C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb51.fas -outfmt 6 -max_target_seqs 2
nHd.2.3.1.t00019-RA     WP_042303394.1  58.678  121     49      1       4       124     93      212     7.84e-047       153
nHd.2.3.1.t00019-RA     seq51   63.115  122     45      0       1       122     105     226     3.59e-042       140

C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb51.fas -outfmt 6 -max_target_seqs 1
nHd.2.3.1.t00019-RA     WP_042303394.1  58.678  121     49      1       4       124     93      212     7.84e-047       153

If the problem is caused by the ungapped alignment step as NCBI staff mentioned (update 2018/11/03 according to the https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_ it performs "a gapped extension based on the gap free extension") , when the sequence showed higher score with ungapped alignment, the problem will not occur.

Apparently, WP_042303394.1 has lower score than KRX89030.1 with ungapped mode.

C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb.fas -outfmt 6 -max_target_seqs 53 -comp_based_stats F -ungapped
...omitted...
nHd.2.3.1.t00019-RA     seq3    63.115  122     45      0       1       122     105     226     1.32e-052       194
nHd.2.3.1.t00019-RA     seq2    63.115  122     45      0       1       122     105     226     1.32e-052       194
nHd.2.3.1.t00019-RA     seq1    63.115  122     45      0       1       122     105     226     1.32e-052       194
nHd.2.3.1.t00019-RA     WP_042303394.1  60.000  105     42      0       4       108     93      197     8.05e-042       158
nHd.2.3.1.t00019-RA     WP_042303394.1  47.368  19      10      0       106     124     194     212     0.020   27.6

Then if I modify WP_042303394.1 to have higher score with ungapped mode, the problem does not occur. dummy_mod.fas contains modified WP_042303394.1, which was named A_WP_042303394.1

C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb_mod.fas -outfmt 6 -max_target_seqs 2 -comp_based_stats F -ungapped
nHd.2.3.1.t00019-RA     A_WP_042303394.1        63.636  121     44      0       4       124     93      213     3.99e-054       199
nHd.2.3.1.t00019-RA     seq52   63.115  122     45      0       1       122     105     226     1.33e-052       194

C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb_mod.fas -outfmt 6 -max_target_seqs 2
nHd.2.3.1.t00019-RA     A_WP_042303394.1        63.636  121     44      0       4       124     93      213     1.45e-053       172
nHd.2.3.1.t00019-RA     seq52   63.115  122     45      0       1       122     105     226     3.66e-042       140

 C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb_mod.fas -outfmt 6 -max_target_seqs 1
 nHd.2.3.1.t00019-RA     A_WP_042303394.1        63.636  121     44      0       4       124     93      213     1.45e-053       172

yey!

With these results, I don't think

https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty833/5106166

BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits.

Please let me know if you have any problems to reproduce my result. (Scoring system of BLAST is very complicated thus the procedure of the last part (etimating behavior using scores in ungapped mode) may not work with other sequences.)

blast • 6.3k views
ADD COMMENT
1
Entering edit mode

the paper states that the order is set by the order in the database, hence if you have the "right" order relative to the search the parameter will have no effect.

To verify what the paper states take the best hit that you have, move it last in the FASTA file, reindex with blast and test again

ADD REPLY
0
Entering edit mode

I am kind of surprised that the note was accepted without details of what exactly was done, how it was done and what it was done with.

I just listened to a blast webinar at NCBI and it sounded like this was not an issue (at least in the example that was shown of a search with 1 max_target_seq and 10. Recommendation was not to use just 1 for any searches (use at least 10 and then filter).

ADD REPLY
0
Entering edit mode

IMO that recommendation is still kind of suspicious - it either works as we think it does or not - this sort of special cases are even worse (if it worked one way in some cases and the other way in some others)

that being said I cannot reproduce the behavior either:

# Get the 16S Microbial data
update_blastdb.pl --decompress 16SMicrobial

# Take the last entry in the database
blastdbcmd -db 16SMicrobial -entry all -outfmt '%a' | tail -1

# The last entry in the database is NR_158155.1

# Get the sequence for the last entry
blastdbcmd -db 16SMicrobial -entry 'NR_158155.1' > s.fa

# Get the best hit when searching for all.
blastn -db 16SMicrobial -query s.fa -outfmt 6 | head -1

# Get the best hit when limiting for 1 sequence.
blastn -db 16SMicrobial -query s.fa -outfmt 6 -max_target_seqs 1 | head -1

the script prints:

NR_158155.1
NR_158155.1 NR_158155.1 100.000 1409    0   0   1   1409    1   1409    0.0 2603
NR_158155.1 NR_158155.1 100.000 1409    0   0   1   1409    1   1409    0.0 2603

so I get the same result when limiting to 1

ADD REPLY
0
Entering edit mode

I get the same result with blast v. 2.7.1 on linux.

Taking a random ID from same database:

$ blastn -db 16SMicrobial -query s.fa -outfmt 6 -max_target_seqs 10 | head -5
NR_158067.1 NR_158067.1 100.000 1473    0   0   1   1473    1   1473    0.0 2721
NR_158067.1 NR_113517.1 94.508  1475    76  5   1   1473    1   1472    0.0 2270
NR_158067.1 NR_113516.1 94.441  1475    77  5   1   1473    1   1472    0.0 2265
NR_158067.1 NR_102445.1 94.456  1461    76  5   15  1473    1   1458    0.0 2244
NR_158067.1 NR_136772.1 94.533  1445    74  5   2   1445    1   1441    0.0 2226

$ blastn -db 16SMicrobial -query s.fa -outfmt 6 -max_target_seqs 1 
NR_158067.1 NR_158067.1 100.000 1473    0   0   1   1473    1   1473    0.0 2721
ADD REPLY
0
Entering edit mode

now that I read the report a little better - I want to mention that my example here may not trigger the bug. As the response from NCBI reply states

In some cases a final HSP will improve enough in the later gapped phase to rise to the top hits.

it just might be that my example does not hit this scenario. Which only makes the behavior more confusing IMO, it seems to work until it does not.

ADD REPLY
0
Entering edit mode

Ohohoh you are right, you are right. & NCBI staff is right. please use query

>nHd.2.3.1.t00019-RA
MSNLGITDPCVDAMNSLGLKLEELQDLEVDAGLGNGGLGRLAACFMDSLATLSIPAIGYGIRYEFGIFNQRVINGEQVEE
RDDWLEFGDPWEKLRQDKKISVYFNGKTYVDKEGRSHWVDTQQIVD

which is mentioned in the 2015 gist post Perform blastp search against nr using max_target_seqs 100 and 10000 (web blast is fine).

You can see apparent difference between the two.

check the Gap numbers of the entries which are only seen in max_target_seqs 10000.

Yeeeeeeey I can sleep well good night.

ADD REPLY
0
Entering edit mode

fishgolden : I have done the blastp search against nr with your protein sequence with three limits 5, 100, 500. Top two hits are the same (with some differences later)

==> out.1e-5.max100.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA KRX89027.1  63.115  122 45  0   1   122 105 226 7.55e-37    140
nHd.2.3.1.t00019-RA KRX89025.1  63.115  122 45  0   1   122 105 226 1.54e-36    140
nHd.2.3.1.t00019-RA KFD69381.1  61.983  121 46  0   1   121 466 586 2.00e-36    141

==> out.1e-5.max500.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA WP_042303394.1  58.678  121 49  1   4   124 93  212 1.08e-40    153
nHd.2.3.1.t00019-RA WP_017775351.1  58.678  121 49  1   4   124 93  212 1.22e-40    153
nHd.2.3.1.t00019-RA KRX89027.1  63.115  122 45  0   1   122 105 226 7.55e-37    140

==> out.1e-5.max5.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA KRX89027.1  63.115  122 45  0   1   122 105 226 7.55e-37    140
nHd.2.3.1.t00019-RA KRX89025.1  63.115  122 45  0   1   122 105 226 1.54e-36    140
nHd.2.3.1.t00019-RA KFD69381.1  61.983  121 46  0   1   121 466 586 2.00e-36    141
ADD REPLY
0
Entering edit mode

nr is updated so there are much more eukaryotic genes now compared with 2015. 500 is not enough I think.

ADD REPLY
0
Entering edit mode

For 5, 100, 500, 1000, 5000, 10000. Searches done locally.

==> out.1e-5.max5.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA KRX89027.1  63.115  122 45  0   1   122 105 226 7.55e-37    140
nHd.2.3.1.t00019-RA KRX89025.1  63.115  122 45  0   1   122 105 226 1.54e-36    140
nHd.2.3.1.t00019-RA KFD69381.1  61.983  121 46  0   1   121 466 586 2.00e-36    141    

==> out.1e-5.max100.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA KRX89027.1  63.115  122 45  0   1   122 105 226 7.55e-37    140
nHd.2.3.1.t00019-RA KRX89025.1  63.115  122 45  0   1   122 105 226 1.54e-36    140
nHd.2.3.1.t00019-RA KFD69381.1  61.983  121 46  0   1   121 466 586 2.00e-36    141

==> out.1e-5.max500.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA WP_042303394.1  58.678  121 49  1   4   124 93  212 1.08e-40    153
nHd.2.3.1.t00019-RA WP_017775351.1  58.678  121 49  1   4   124 93  212 1.22e-40    153
nHd.2.3.1.t00019-RA KRX89027.1  63.115  122 45  0   1   122 105 226 7.55e-37    140

==> out.1e-5.max1000.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA WP_042303394.1  58.678  121 49  1   4   124 93  212 1.08e-40    153
nHd.2.3.1.t00019-RA WP_017775351.1  58.678  121 49  1   4   124 93  212 1.22e-40    153
nHd.2.3.1.t00019-RA WP_114812007.1  57.851  121 50  1   4   124 93  212 1.76e-39    150

==> out.1e-5.max5000.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA WP_042303394.1  58.678  121 49  1   4   124 93  212 1.08e-40    153
nHd.2.3.1.t00019-RA WP_017775351.1  58.678  121 49  1   4   124 93  212 1.22e-40    153
nHd.2.3.1.t00019-RA WP_114812007.1  57.851  121 50  1   4   124 93  212 1.76e-39    150

==> out.1e-5.max10000.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA WP_042303394.1  58.678  121 49  1   4   124 93  212 1.08e-40    153
nHd.2.3.1.t00019-RA WP_017775351.1  58.678  121 49  1   4   124 93  212 1.22e-40    153
nHd.2.3.1.t00019-RA WP_114812007.1  57.851  121 50  1   4   124 93  212 1.76e-39    150
ADD REPLY
0
Entering edit mode

Sorry, I misunderstood what you said. The third or forth hits of 500 are the ones (or family of ones) which the gist author mentioned (those are from bacteria (Paraburkholderia kururiensis)). More similar eukaryotic sequences were added after the author's post.

ADD REPLY
0
Entering edit mode

The statistics are so close on later hits that their order may be re-arranged a bit in the display. Top two hits are clearly different. Once we get into 1e-39 and 1e-40 territory those are very similar.

ADD REPLY
0
Entering edit mode

I think the ones who address this problem as "bug" do not care about e-value differences.

They just mention the sequences exist or not.

ADD REPLY
0
Entering edit mode

Please perform text search for "UniRef50_A7E2S9" and "UniRef50_A0A0D9RFQ1" in this page or gist.

If the order in the database decides the order in the results, the hits appeared with "-max_target_seq 1" for ”testfas1.fas” and "testfas2.fas" must be the same.... I think.

ADD REPLY
0
Entering edit mode

@Istvan: I tried moving the fasta sequence around in a file, re-indexing the file. The result does not change. So blastn is not affected. Let me see if I can test with blastp.

Query file (truncated for brevity):
$ head qur.txt
>NR_158114.1 Mycoplasma tullyi strain 56A97 16S ribosomal RNA, partial sequence
GAGAGTTTGGATCCTGGCTCAGGATTAACGCTGGCGGCATGCCTAATACATGCAAGTCGATCGGATGTAGCAATACATTA
GAGGCGAACGGGTGAGTAACACGTATCCAATCTGCCTTATAGTGGGGGATAACTAGTCGAAAGATTAGCTAATACCGCAT
AACAAGTTACCTATCGCATGAGAATAACTTTAAAGGAGCAACTGCTTCGCTATAAGATGAGGGTGCGGCATATCAGCTAG
Version 1
$ grep ">" s2.fa
>NR_158113.1 Anaerotignum aminivorans strain SH021 16S ribosomal RNA, partial sequence
>**NR_158114.1** Mycoplasma tullyi strain 56A97 16S ribosomal RNA, partial sequence
>NR_158115.1 Borrelia lanei strain CA28-91 16S ribosomal RNA, partial sequence
>NR_158116.1 Marispirochaeta aestuarii strain JC444 16S ribosomal RNA, partial sequence
>NR_158117.1 Luteolibacter gellanilyticus strain CB-286403 16S ribosomal RNA, partial sequence
>NR_158118.1 Petrothermobacter organivorans strain ANA 16S ribosomal RNA, partial sequence
>NR_158119.1 Edaphobacter acidisoli strain 4G-K17 16S ribosomal RNA, partial sequence
>NR_158120.1 Hydrogenophaga aquatica strain CC-KL-3 16S ribosomal RNA, partial sequence
>NR_158121.1 Thalassotalea profundi strain YM155 16S ribosomal RNA, partial sequence
>NR_158122.1 Consotaella salsifontis strain USBA 369 16S ribosomal RNA, partial sequence
>NR_158123.1 Oryzisolibacter propanilivorax strain EPL6 16S ribosomal RNA, partial sequence
$ blastn -db s2.fa -query qur.txt -outfmt 6 -max_target_seqs 10
NR_158114.1 NR_158114.1 100.000 1469 0 0 1 1469 1 1469 0.0 2713
NR_158114.1 NR_158113.1 76.387 1190 219 42 232 1393 230 1385 2.28e-169 584
NR_158114.1 NR_158118.1 75.182 1233 253 34 232 1424 171 1390 8.39e-154 532
NR_158114.1 NR_158115.1 74.579 1424 261 77 14 1378 1 1382 3.02e-153 531
NR_158114.1 NR_158116.1 75.000 1192 260 30 213 1378 237 1416 8.45e-149 516
NR_158114.1 NR_158119.1 76.956 946 169 41 470 1378 407 1340 3.96e-142 494
NR_158114.1 NR_158123.1 73.717 1442 292 55 3 1394 1 1405 8.57e-139 483
NR_158114.1 NR_158120.1 74.393 1195 262 28 229 1394 224 1403 1.86e-135 472
NR_158114.1 NR_158121.1 74.331 1196 258 35 232 1393 159 1339 1.12e-132 462
NR_158114.1 NR_158117.1 73.764 1193 255 42 225 1378 165 1338 8.82e-119 416
Version 2 (query moved a couple of spots down in the file)
$ grep ">" s2_v2.fa
>NR_158113.1 Anaerotignum aminivorans strain SH021 16S ribosomal RNA, partial sequence
>NR_158115.1 Borrelia lanei strain CA28-91 16S ribosomal RNA, partial sequence
>NR_158116.1 Marispirochaeta aestuarii strain JC444 16S ribosomal RNA, partial sequence
>**NR_158114.1** Mycoplasma tullyi strain 56A97 16S ribosomal RNA, partial sequence
>NR_158117.1 Luteolibacter gellanilyticus strain CB-286403 16S ribosomal RNA, partial sequence
>NR_158118.1 Petrothermobacter organivorans strain ANA 16S ribosomal RNA, partial sequence
>NR_158119.1 Edaphobacter acidisoli strain 4G-K17 16S ribosomal RNA, partial sequence
>NR_158120.1 Hydrogenophaga aquatica strain CC-KL-3 16S ribosomal RNA, partial sequence
>NR_158121.1 Thalassotalea profundi strain YM155 16S ribosomal RNA, partial sequence
>NR_158122.1 Consotaella salsifontis strain USBA 369 16S ribosomal RNA, partial sequence
>NR_158123.1 Oryzisolibacter propanilivorax strain EPL6 16S ribosomal RNA, partial sequence
$ blastn -db s2_v2.fa -query qur.txt -outfmt 6 -max_target_seqs 10
NR_158114.1 NR_158114.1 100.000 1469 0 0 1 1469 1 1469 0.0 2713
NR_158114.1 NR_158113.1 76.387 1190 219 42 232 1393 230 1385 2.28e-169 584
NR_158114.1 NR_158118.1 75.182 1233 253 34 232 1424 171 1390 8.39e-154 532
NR_158114.1 NR_158115.1 74.579 1424 261 77 14 1378 1 1382 3.02e-153 531
NR_158114.1 NR_158116.1 75.000 1192 260 30 213 1378 237 1416 8.45e-149 516
NR_158114.1 NR_158119.1 76.956 946 169 41 470 1378 407 1340 3.96e-142 494
NR_158114.1 NR_158123.1 73.717 1442 292 55 3 1394 1 1405 8.57e-139 483
NR_158114.1 NR_158120.1 74.393 1195 262 28 229 1394 224 1403 1.86e-135 472
NR_158114.1 NR_158121.1 74.331 1196 258 35 232 1393 159 1339 1.12e-132 462
NR_158114.1 NR_158117.1 73.764 1193 255 42 225 1378 165 1338 8.82e-119 416
Version 3 (query moved to end of file)
$ grep ">" s2_end.fa
>NR_158113.1 Anaerotignum aminivorans strain SH021 16S ribosomal RNA, partial sequence
>NR_158115.1 Borrelia lanei strain CA28-91 16S ribosomal RNA, partial sequence
>NR_158116.1 Marispirochaeta aestuarii strain JC444 16S ribosomal RNA, partial sequence
>NR_158117.1 Luteolibacter gellanilyticus strain CB-286403 16S ribosomal RNA, partial sequence
>NR_158118.1 Petrothermobacter organivorans strain ANA 16S ribosomal RNA, partial sequence
>NR_158119.1 Edaphobacter acidisoli strain 4G-K17 16S ribosomal RNA, partial sequence
>NR_158120.1 Hydrogenophaga aquatica strain CC-KL-3 16S ribosomal RNA, partial sequence
>NR_158121.1 Thalassotalea profundi strain YM155 16S ribosomal RNA, partial sequence
>NR_158122.1 Consotaella salsifontis strain USBA 369 16S ribosomal RNA, partial sequence
>NR_158123.1 Oryzisolibacter propanilivorax strain EPL6 16S ribosomal RNA, partial sequence
>**NR_158114.1** Mycoplasma tullyi strain 56A97 16S ribosomal RNA, partial sequence
$ blastn -db s2_end.fa -query qur.txt -outfmt 6 -max_target_seqs 10
NR_158114.1 NR_158114.1 100.000 1469 0 0 1 1469 1 1469 0.0 2713
NR_158114.1 NR_158113.1 76.387 1190 219 42 232 1393 230 1385 2.28e-169 584
NR_158114.1 NR_158118.1 75.182 1233 253 34 232 1424 171 1390 8.39e-154 532
NR_158114.1 NR_158115.1 74.579 1424 261 77 14 1378 1 1382 3.02e-153 531
NR_158114.1 NR_158116.1 75.000 1192 260 30 213 1378 237 1416 8.45e-149 516
NR_158114.1 NR_158119.1 76.956 946 169 41 470 1378 407 1340 3.96e-142 494
NR_158114.1 NR_158123.1 73.717 1442 292 55 3 1394 1 1405 8.57e-139 483
NR_158114.1 NR_158120.1 74.393 1195 262 28 229 1394 224 1403 1.86e-135 472
NR_158114.1 NR_158121.1 74.331 1196 258 35 232 1393 159 1339 1.12e-132 462
NR_158114.1 NR_158117.1 73.764 1193 255 42 225 1378 165 1338 8.82e-119 416

ADD REPLY
0
Entering edit mode

now that I looked at it in more detail there is more to it, the "bug" or "behavior" is triggered only in some special cases.

In some cases a final HSP will improve enough in the later gapped phase to rise to the top hits.

You have to have an alignment that matches the statement above.

Look at the examples that Brad Chapman has on the github page and search against NR so that there is ample room for the behavior to trigger.

ADD REPLY
1
Entering edit mode

C: Misunderstood parameter of NCBI BLAST (Response from NCBI in NAR posted by @Vijay)

ADD REPLY
0
Entering edit mode

Thank you for the notification! I was very surprized that there was a bug. It made problems complicated........

ADD REPLY
0
Entering edit mode

Just to be certain, you are using exactly the same versions of blast/database etc as referred in the paper?

ADD REPLY
0
Entering edit mode

I couldn't find the version number in the paper. In fact, it's a letter made from 2 pages. There are not example commands and queries and databases which result in the problem. Thus I wrote "problem" not "results".

ADD REPLY
0
Entering edit mode

The other researcher uses Linux & 2.2.28~31 I'll check this tomorrow. 10/5 : I updated main text & uploaded the result file. https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5

ADD REPLY
0
Entering edit mode

fishgolden : We can't see the screenshots.

ADD REPLY
0
Entering edit mode

me too. it is the post in 3 yeas ago.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

It come from the preview of the link of gist I pasted. Biostars changes it automatically. Sorry it is confusing. I changed.

ADD REPLY
0
Entering edit mode

genomax Thank you.

ADD REPLY
0
Entering edit mode

Recommendation: show the alignments as text and not images

also remove the unnecessary data listings, those make the post difficult to follow

ADD REPLY
0
Entering edit mode

Thank you for the suggestion. I'll do it when I have a time. (Unfortunately, I'm a little bit busy for the next several days.)

ADD REPLY
5
Entering edit mode
6.2 years ago
Peter 6.0k

It took me a while to publish this on GitHub, but I made a self-contained test case very close to the original 2015 NR example using current hits including WP_042303394.1 and KRX89030.1 but not the tardigrade sequences which were added after 2015:

I really like your approach to make test databases containing one copy of WP_042303394.1 and lots of copies of KRX89030.1.

Your database and mine both reproduce the original problem reported by Sujai in December 2015, but like you I was not able reproduce the claim from Shar et al. (2018) about the database order being important (other than in the trivial case of a tie breaker for identical scores and e-value). I've included a link to your example in the followup post:

See also this description of the BLAST algorithm and the internal limit of 2*N+50 which matches the source code you quoted, and explains why with -max_target_seqs 1 the threshold is 52 candidates, and thus your databases with 52 or 53 sequences gave different answers:

ADD COMMENT
2
Entering edit mode

Thank you! https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_ is a good explanation. It matches with what I could understand from the source codes of BLAST.

The matter I was referring is written in C4 as N_i. And what lieven.sterck referred is written in C3 as X_g. (With my understandings).

ADD REPLY
2
Entering edit mode
6.3 years ago

shamelessly abusing the 'answer' functionality, just to create a side issue here:

I want to point out that this kind of behaviour is not only linked to the -max_target_seq parameter but to all filtering parameters (eg. Evalue).

See this post:

https://bioinformatics.stackexchange.com/questions/3409/blast-hits-disappearing-after-changing-evalue/4146

I once read one on the Evalue specific in which they showed you can go from a prokaryotic to an eukaryotic best hit, but can't seem to find it back (will add if I do)

ADD COMMENT
1
Entering edit mode

Considering how widely BLAST is used and how often people filter on e-values no wonder that we have an irreproducibility problem in life sciences

ADD REPLY
0
Entering edit mode

Istvan Albert : Perhaps I am missing the point but this isn't a make or break type situation, correct? If you are blindly selecting just one hit from a blast search then either your application needs are casual or you are not doing the analysis as carefully as you should.

Based on the tests I did above there is hardly any difference in results. If you are filtering at 1e-39 vs 1e-40 there is not much to say. This is akin to setting a p-value filter in RNAseq data analysis.

Irreproducibility problem based on e-value filtering may have more to do with new sequences being added to the database all the time. We are not using a stable blast DB release like we do for genomes.

ADD REPLY
1
Entering edit mode

My point is that there is a responsibility that toolmakers have to properly document and make a note of situations that might lead to misunderstandings. It is insufficient to point to the documentation (fine print).

Especially in the light that this was noted in 2015, most people that can be considered experts agreed that it was wholly unreasonable and NCBI did nothing about it (or perhaps fixed their web search, hence "pushing the dirt" even further under the rug).

The problem is not as much that the parameter exists or that is called this or that - everyone makes mistakes - it is the arrogance of not doing anything about it for three years is what annoys me the most.

ADD REPLY
0
Entering edit mode

You're absolutely correct on the DB side of this issue genomax and this will certainly add to the whole irreproducibility . But there is still on underlying issue with the blast. The main issue is that people (myself included up to a few months ago) that these kind of thresholds would only affect the 'reporting' but it turns out they are actually influencing the 'algorithm' part of blast causing the observations.

Concerning your test case: the test would be better if you used a protein sequence not (yet) present in the the database (regardless of anything, blast does, in most cases, always finds the identical sequence back). I suggest you select one from the DB, remove it then from the DB, then check the best hit and start moving that one around. But I'm also not really convinced, as stated here somewhere as well, that the issue is 'position in DB related'

ADD REPLY
2
Entering edit mode

these kind of thresholds would only affect the 'reporting' but it turns out they are actually influencing the 'algorithm' part of blast causing the observations.

But has this produced a result that is invalid or toally inaccurate. I don't think BLAST guarantees an optimal solution every time. If you need to ensure that then Smith-Waterman should be used instead.

ADD REPLY
1
Entering edit mode

issue (where also Peter & A: (solved) I couldn't reproduce the problem of max_target_seqs is involved) described here might catalogue as such I think

ADD REPLY
0
Entering edit mode

I agree with you " I don't think BLAST guarantees an optimal solution every time. If you need to ensure that then Smith-Waterman should be used instead.".

Heuristically, I cannot decide which group of sequences is wanted

  • A group of sequences which have high scores in preliminary step but lower scores in the final step
  • A group of sequences which have lower scores in preliminary step but higher scores in the final step

when both groups have very small evalue in the final step.

I'm a greedy person that I want to get both groups. But how about the person who set "-max_target_seqs 1" ...? The latter one? But the former has higher scores at the preliminary step (with the algorithm differs from that of final step) it means that the former is better than the latter in different point of view.

On the other hands, if somebody says "should be better documented", I also agree with it.

https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_ is good document. I’m wondering when this document was made. Last update is May 18, 2016. My bad if it has been two years.

(edited 2018/11/03/23:11JST)

ADD REPLY
1
Entering edit mode

I asked by email and https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_ was added in October 2018 (which I mentioned when I linked to it in https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-two.html ), but they weren't sure where the last modified date came from and why it had not updated. Note this was added after the Shah et al 2018 letter was published.

ADD REPLY
0
Entering edit mode

Thank you for the information! & Thank you for taking the trouble to ask the support. I have checked google chache and archive.is but couldn't confirm the date.

ADD REPLY
1
Entering edit mode

I checked too, before simply asking - the reason there is a cache on archive.org from 2 November is because I requested the snapshot ;)

ADD REPLY
0
Entering edit mode

I think the e-value matter is caused in the probability distribution estimation step. (Just I think. There are no evidence.)

(update 2018/11/03)

I was 90% wrong. Some estimations (evalue to score) were done but according to the https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_ the filterings were done at the step same as the max_target_seqs.

ADD REPLY
0
Entering edit mode

likely, but common interpretation will also be that this is done at the end (== the 'reporting' stage)

ADD REPLY
0
Entering edit mode

Yes, e-value is also used as a filter before the final step - this was also noted in the comment from Harm Nijveen on 3 Dec 2015 on the original max_target_seqs report from Sujai (I would link to it but BioStars insists on embedding the entire gist)

ADD REPLY
0
Entering edit mode
5.8 years ago
gb ★ 2.2k

I know this tread is old and solved but wanted to add something if people run into the same "problem". If you use -perc_identity 100 it can happen that you will not get a hit at all even that the database contains the same sequence as the query.

In my case I do not have a hit at all with -max_target_seqs 1 -perc_identity 100 or even -max_target_seqs 7 -perc_identity 100. But when I set the -max_target_seqs higher then 8 blastn will find and return the right hit (and only one because there are no other perfect matches).

If I do -max_target_seqs 1 -perc_identity 99 I get the another hit which is not exactly the same as my query.

If I do -max_target_seqs 8 -perc_identity 99 I get 8 hits and none of them is exactly the same as my query. I used -max_target_seqs 8 because that was the treshold in the first testcase.

I need to do -max_target_seqs 16 -perc_identity 99 to get my exact match in the results.

EDIT:

Thanks to the comment of genomax. I used blast-2.6.0 and 2.8.1 and used a custom selection of the nt database.

ADD COMMENT
0
Entering edit mode

This thread is only 5 months old.

You need to add information about version of blast+ and what databases you are using. If these observations are for a custom database then these observations may only apply to your specific case.

ADD REPLY
0
Entering edit mode

Yes they do only apply for me, specially these specific parameters. I wanted to point out that instead of a different result you can also get no hits at all because of the -max_target_seqs parameter. Maybe for some it is obvious but for others they can find the cause to there problem.

If it does not fit here I can remove the post ofcourse.

ADD REPLY
0
Entering edit mode

I made several dummy dbs and ran searches several times but I couldn't reproduce the problem.

There are many things to consider.

  • How long is the length of the query sequence?
  • Does it contain N?
  • How long is the lengths of -perc_identity 99 hits?
  • Are they masked?
  • How are the alignments with them?
  • How is the result with the option "-ungapped"?

etc.

To investigate deeper, we need reproducible examples.

ADD REPLY

Login before adding your answer.

Traffic: 3580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6