News:Misunderstood parameter of NCBI BLAST
4
7
Entering edit mode
6.2 years ago
Farbod ★ 3.4k

Hi,

An interesting paper that I just wanted to share it with Biostars:

Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows

blast alignment • 8.4k views
ADD COMMENT
1
Entering edit mode

I wrote related entry (& moved it to) I couldn't reproduce the problem of max_target_seqs according to the suggestion made by genomax.

ADD REPLY
0
Entering edit mode

You should create a new post if you think this is a reproducible problem. Posting it in this thread is not the best place for this.

ADD REPLY
0
Entering edit mode

Uh, really? I never thought of that. Thank you for the suggestion. Ok, I'll create a new post. But I do not think the problem (in the paper) is reproducible and I thought I had written as such.

ADD REPLY
0
Entering edit mode

What I meant was that not being able to reproduce the results in the paper may be a reproducible problem, if others get the same results as you did.

ADD REPLY
0
Entering edit mode

I see, thank you. (I'm googling how to close my answer...) edited: Ok I think i did fine.

ADD REPLY
0
Entering edit mode

Certainly not a recent 'problem' as this issues has been raised many years ago:

blast-max-target-sequences-bug

ADD REPLY
0
Entering edit mode

Which is actually properly cited in the Shah et al. paper: "This functionality was first reported as a bug to NCBI by Kumar (2015), and later documented in a blog post (Cock, 2015) by Peter Cock."

ADD REPLY
6
Entering edit mode
6.2 years ago

I will cite this with the fair use policy:

To enable the efficient processing of large data sets, researchers frequently rely on shortcuts aimed at reducing the number of BLAST results that need to be processed. A common strategy involves using the "- max_target_seqs" parameter of the NCBI BLAST+ suite. According to the BLAST documentation itself (2008-), this parameter represents the "number of aligned sequences to keep". This statement is commonly interpreted as meaning that BLAST will return the top N database hits for a sequence query if the value of max_target_seqs is set to N. For example, in a recent article (Wang, et al., 2016) the authors explicitly state "Setting “max target seqs” as “1,” only the best match result was considered."

To our surprise, we have recently discovered that this intuition is incorrect. Instead, BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter "-max_target_seqs 1" simply returns the first good hit found in the database, not the best hit as one would assume.

Worse yet, the output produced depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions contain the same best hit for this database sequence.

ADD COMMENT
3
Entering edit mode

To be fair the option does not say -max_**best**_target_seqs or -max_**high_scoring**_target_seqs.

ADD REPLY
1
Entering edit mode

That is fair to some extent - in my opinion the source of the confusion that it says max there - instead it should be called limit or even better first to make it unambiguous.

Usually, there are many things to juggle and consider in a typical analysis - it is very easy to slip up and take the max as maximal in a different context: the score or some other attribute.

ADD REPLY
1
Entering edit mode

Also from the same paper:

The confusion is further compounded by the fact that in the online BLAST portal, the max_target_seqs parameter behaves in the expected way – the best (rather than first) N hits are returned

ADD REPLY
0
Entering edit mode

NCBI does things with the web version that are not available in the command line package. That does lead to some confusion since it is easy to assume/think that those two are equivalent.

ADD REPLY
0
Entering edit mode

Yeah, but the option is basically useless as it is. Like, in what kind of use case would the output make sense? Why not have an option -max_random_hits while at it..

ADD REPLY
1
Entering edit mode
6.2 years ago

Just noticed it in the book titled A Primer for Computational Biology where it explicitly states best

https://www.amazon.com/Primer-Computational-Biology-Shawn-ONeil/dp/0870719262

enter image description here

ADD COMMENT
0
Entering edit mode

Is that in reference to the web portal?

ADD REPLY
0
Entering edit mode

I did not check at the time, but note how it talks about output format 6, 7 or 10 that sounds to me like command line use - I doubt that would be an online parameter one would set

ADD REPLY
1
Entering edit mode

Author here! I had no idea about this behavior until recently. I'm glad to have learned of it though - I've updated the online version of the book with some errata linking to the paper for details.

ADD REPLY
1
Entering edit mode
6.2 years ago

It's only unfortunate the paper omitted to state that this affects all the filtering parameters in the blast algorithm, such as Evalue, num_alignments,

Otherwise good to finally see this issue described in a manuscript context instead of blog posts.

ADD COMMENT
0
Entering edit mode
6.2 years ago
fishgolden ▴ 520

(solved) I couldn't reproduce the problem of max_target_seqs I could reproduce the problem and my conclusion is that the problem is not caused by the matter written in the paper, but caused by what NCBI staff explained in 2015.

ADD COMMENT
0

Login before adding your answer.

Traffic: 1907 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6