Obtaining A Maximum Number Of Blast Hits: Problem...
5
3
Entering edit mode
12.3 years ago
Nicojo ★ 1.1k

I am having trouble getting blast to give me "correct" results.

I am trying to retrieve as many hits with e-value better than 1. I query the database with a sequence that should have several thousand hits in the database. However, at best, using tblastn, I get more or less 1000 matches (~250 independent hits). I am at a loss understanding what is wrong with my command:

tblastn -query protein.fasta -db nucl.blastdb -out results.tblastnout -evalue 1 -outfmt 7 -num_descriptions 100000

I get several matches within one same sequence hit, all with e-values better than 1. But what confuses me is that the best e-value of the worst hit (sorry if this is confusing ;) is nowhere near the -evalue limit, and is usually lower than 1E-60... Obviously, even including the redundancy of matches within a hit, I certainly do not reach the 100000 limit I asked for.

So I have three questions:

  1. Is it possible to only list one (the best) "match" per "hit"?
  2. Any idea why I do not get a larger number of descriptions, considering that I expect to have close to 30000 positives in my database?
  3. Any comments/suggestions to improve my search?

EDIT:

After investigation and testing (a bit)...

For question 1)

In order to limit the number of matches (HSPs) in a sequence hit, I tried the "Best-Hits filtering algorithm". As a result, this brought my newly found 50000 hits (with -max_target_seqs) down to 253. In reality however, it increased the number of positive hits by 2. So I'm guessing that the number of "hits" listed in the header of the output is only the number of HSPs. In any case, this seems to help with the multiple hit per target sequence. But is it the full answer? Or are there other ways?

For question 2)

With the argument -num_descriptions, changing the value doesn't really change the result. It turns out that with tabular formatting, one should use the argument -max_target_seqs instead of the previous one (see latest BLAST user manual). As a result, I get an interestingly different output: the number of "hits" announced in the output header dramatically increases, in this case to >50000. This is the result that I'd expect... However, the actual list of results is exactly the same as with -num_descriptions :(

blast • 27k views
ADD COMMENT
2
Entering edit mode
12.3 years ago
Nicojo ★ 1.1k

NOTE: This question and answer concerns BLAST 2.2.25+... I do not know if any of this is valid for other versions of BLAST.

Answer to question #2:

My initial (main) problem was that the following blast+ command...

tblastn -query protein.fasta -db nucl.blastdb -out results.tblastnout -evalue 1 -outfmt 7 -num\_descriptions 100000

...did not give me the result that I was expecting.

Normally, in BLAST (whether the old or the new), you can say how many descriptions and alignments you want (depending on the version):

  • -v or -num_descriptions determine the number of descriptions that are output. By default these are 500.
  • -b or -num_alignments determine the number of alignments that are output. By default these are 250.

Indeed, with these default settings, and provided there are at least 500 hits, a blast will produce 500 descriptions and 250 alignments. This should imply that each argument is independent of the other.

When it comes to the tabular output formats (flag -outfmt 6 and above) the output is devoid of alignments. In BLAST+, the documentation even suggests that one should use the flag -max_target_seqs instead of -num_descriptions. This should imply that the flag -num_alignments has no impact on tabular format outputs.

Oddly enough, it seems that neither of these assumptions are true, when requesting tabular output formatting: although the num_alignment flag can be ignored for pairwise output, it will limit the num_descriptions output.

In order to retrieve the maximum number of hits in a tabular format, with BLAST+, one should use a command similar to the following:

tblastn -query protein.fasta -db nucl.blastdb -out results.tblastnout -evalue 1 -outfmt 7 -num\_descriptions 100000 -num\_alignments 100000

Why this would be the case beats me. I will ask the NCBI BLAST team and report back if I get an answer...

Update:

I have contacted the BLAST Help team, and received a reply. However, they only pointed to the use of -maxtargetseqs when using outfmt flags >4. I emailed back to point out that in version 2.2.25+, that this does not return the expected result. I'm still waiting to hear back from them...

So, for anybody having issues, I highly recommend to use the latest version of BLAST: 2.2.27+. I have verified: the use of the option "maxtargetseqs" works as expected!

Update 2:

No news from the BLAST Help team... I guess they don't like it when people point out bugs :(

Conclusion: USE THE LATEST VERSION ;)

ADD COMMENT
0
Entering edit mode

I am having the same problem with tblastx (version 2.2.28+). I am comparing two bacterial genomes, and when I run tblastx, I only get 400 hits, independently of max_target_seqs and num_alignments values...

I believe that there is a bug, because when I compare a genome with itself, I only get 400 hits...

ADD REPLY
1
Entering edit mode
12.3 years ago
Ketil 4.1k

There is a limit on the number of alignments that will be produced. From the (old) BLAST documentation:

-v  Number of database sequences to show one-line descriptions for (V) [Integer]
    default = 500
-b  Number of database sequence to show alignments for (B) [Integer]
    default = 250

I'm not familiar with BLAST+ (which you seem to be using), but I suppose it has the same limits, and similar options to modify them?

ADD COMMENT
0
Entering edit mode

@Ketil, thanks for the answer! However, I'm not quite sure that the "default" means "max number limit"... In any case, changing the number to e.g. 5000 does not improve the result. Since I already reach >1000, I don't think there is such a limit. Any other idea?

ADD REPLY
0
Entering edit mode

Which of these numbers did you try to change? Looking at some of my files (i.e. looking at Hit_num in BLAST XML output), there is a cap on hits at 500, I imagine increasing -v would lift that.

Anyway, if that doesn't work, I'm out of ideas.

ADD REPLY
0
Entering edit mode

@Ketil: my command line calls for "-outfmt 7", which in BLAST+ is a tabular format. Therefore no alignments are output. The only arguments that influence the number of hits are "-num_descriptions" and, as I just found out, "-max_target_seqs". Initially I changed the first. Now I've tested both: check out the edit to the question, as it gets even stranger...

ADD REPLY
0
Entering edit mode

thanks for following up - many of these tools make use of a number of tacit assumptions when they are operated outside of the most common parameter range - I'm learning a lot from it!

ADD REPLY
0
Entering edit mode

Indeed, the documentation is good, but not extensive... I'm still hesitating between obscureness of the tool options or bad assumptions on my part. I am in the process of testing my dataset to be sure that it contains what I believe it does. Will update as I go!

ADD REPLY
0
Entering edit mode

can i limit the number of hits to 10 best ones

the command line i use is

./blastn -query /home/Downloads/esembl/EDL36652.1d.fasta -db /home/Desktop/ncbi-blast-2.2.30+/output2/mart -outfmt 7 -max_target_seqs 1 -out /home/CCMB/Desktop/ncbi-blast-2.2.30+/outformat/result

but i end up with 31 hits

ADD REPLY
0
Entering edit mode
12.3 years ago
AGS ▴ 250

@Nicojo,

"Is it possible to only list one (the best) "match" per "hit"?"

If I remember correctly, using the -K 1 flag will limit output to the best hit.

ADD COMMENT
0
Entering edit mode

please add this as as comment not as a new answer

ADD REPLY
0
Entering edit mode

This is an answer to question #1 that the OP asked.

ADD REPLY
0
Entering edit mode

ok I see, my bad, since you had the @Nicojo there I thought that you are referring to his comment above.

ADD REPLY
0
Entering edit mode

@AGS: thanks for your answer. Unfortunately, that flag seems to be from the old NCBI BLAST, and not the current NCBI BLAST+. I have just found a recent version of the BLAST Command Line Applications Manual, with the description of all arguments. Unfortunately, there is no "K" argument. The only thing I could imagine were the "-best_hit_overhang" and "-best_hit_score_edge". See question edit for details.

ADD REPLY
0
Entering edit mode
12.3 years ago
Rm 8.3k

To control number of top hits from blast+ : For descriptions: [-num_descriptions int_value] ; for alignments: [-num_alignments int_value]

which are equivalent to -v and -b options from older blast

ADD COMMENT
0
Entering edit mode

@Rm: thanks. Unfortunately those do not do the job, which is why I'm asking the question. Please read the full post for the details...

ADD REPLY
1
Entering edit mode

My two cents: Output all possible hits in default (pair wise) format using very large values for the [-num_descriptions int_value] and [-num_alignments int_value] and see how many hits you will actually get; then you can play with the -outfmt 7 options , or you can write a parser script to get desired output from default output format.

ADD REPLY
0
Entering edit mode

@Rm: thanks again... I tried the default output format (only with -num_descriptions, not -num_alignments), and that gave me the result I was expecting. But that doesn't explain why I do not get that result with the tabular output format. In any case, your suggestion put my on the right track. I'll post a complete answer for more details. So, thanks!

ADD REPLY
0
Entering edit mode
7.9 years ago
CB ▴ 20

I had the same issue using blast 2.5.0 I used the option max_target_seqs 10, and I had many cases with more than 10 hits. For some genes I had 10 hits, but for many I had >50 hits. I used the tabular output (-outfmt '7 qseqid sseqid pident evalue stitle qcovs length sstart send' )

ADD COMMENT

Login before adding your answer.

Traffic: 2006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6