Hi,
An interesting paper that I just wanted to share it with Biostars:
Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows
Hi,
An interesting paper that I just wanted to share it with Biostars:
Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows
I will cite this with the fair use policy:
To enable the efficient processing of large data sets, researchers frequently rely on shortcuts aimed at reducing the number of BLAST results that need to be processed. A common strategy involves using the "- max_target_seqs" parameter of the NCBI BLAST+ suite. According to the BLAST documentation itself (2008-), this parameter represents the "number of aligned sequences to keep". This statement is commonly interpreted as meaning that BLAST will return the top N database hits for a sequence query if the value of max_target_seqs is set to N. For example, in a recent article (Wang, et al., 2016) the authors explicitly state "Setting “max target seqs” as “1,” only the best match result was considered."
To our surprise, we have recently discovered that this intuition is incorrect. Instead, BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter "-max_target_seqs 1" simply returns the first good hit found in the database, not the best hit as one would assume.
Worse yet, the output produced depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions contain the same best hit for this database sequence.
That is fair to some extent - in my opinion the source of the confusion that it says max
there - instead it should be called limit
or even better first
to make it unambiguous.
Usually, there are many things to juggle and consider in a typical analysis - it is very easy to slip up and take the max
as maximal in a different context: the score or some other attribute.
Just noticed it in the book titled A Primer for Computational Biology where it explicitly states best
https://www.amazon.com/Primer-Computational-Biology-Shawn-ONeil/dp/0870719262
It's only unfortunate the paper omitted to state that this affects all the filtering parameters in the blast algorithm, such as Evalue, num_alignments,
Otherwise good to finally see this issue described in a manuscript context instead of blog posts.
(solved) I couldn't reproduce the problem of max_target_seqs I could reproduce the problem and my conclusion is that the problem is not caused by the matter written in the paper, but caused by what NCBI staff explained in 2015.
According to the Reply to the paper: Misunderstood parameters of NCBI BLAST impacts the correctness of bioinformatics workflows, + there was a bug.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Reply to the paper: Misunderstood parameters of NCBI BLAST impacts the correctness of bioinformatics workflows
I wrote related entry (& moved it to) I couldn't reproduce the problem of max_target_seqs according to the suggestion made by genomax.
You should create a new post if you think this is a reproducible problem. Posting it in this thread is not the best place for this.
Uh, really? I never thought of that. Thank you for the suggestion. Ok, I'll create a new post. But I do not think the problem (in the paper) is reproducible and I thought I had written as such.
What I meant was that not being able to reproduce the results in the paper may be a reproducible problem, if others get the same results as you did.
I see, thank you. (I'm googling how to close my answer...) edited: Ok I think i did fine.
Certainly not a recent 'problem' as this issues has been raised many years ago:
blast-max-target-sequences-bug
Which is actually properly cited in the Shah et al. paper: "This functionality was first reported as a bug to NCBI by Kumar (2015), and later documented in a blog post (Cock, 2015) by Peter Cock."