Question

(BLAST) Why different e-values if I query a sequence alone, or together with other sequences?

1

Entering edit mode

4.6 years ago

johan ▴ 120

Hi, I need to create a table with how e-values are distributed for some sequences, as a way of reporting how conserved the sequences are.

I got some inconsistent results, and boiled it down to if I query a sequence alone, or if I query it together with other sequences. The result page has a drop-down menu where you can only pick a single query sequence. So I assume that it is independent of the other query sequences?

Here is an example to show it. In the first test, I query ">1" alone, and the top hits are 4e-9. In the second test, I query ">1" together with ">0" and ">2", and when I look at only ">2", the top hits are "3e-9"

First test set-up:

First test results:

Second test set-up: Second test results:

I just did the same test with other sequences, and I got either 0.35 or 0.1 as the top hits. All settings are identical between the two searches. I just go to nucleotide BLAST, enter my queries, enter an organism, change to "blastn", and change the number of hits to 20.000. All other settings are the defaults.

So what is the correct way of doing this search? I'm so confused at the moment :<

BLAST e-value • 2.5k views

ADD COMMENT • link 4.6 years ago by johan ▴ 120

0

Entering edit mode

4.6 years ago

Istvan Albert 102k

This sounds like the problem caused by max_target_sequences described in

Misunderstood parameter of NCBI BLAST

Here is a direct link to the publication

https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty833/5106166?redirectedFrom=fulltext

ADD COMMENT • link 4.6 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks for the comment. I wasn't aware of this problem. However, I really don't think this is the issue here.

Here is another search:

First, my query sequence is:

>my_sequence
gacttatcAAAactggcaGGGGGccactgCCCacaggattagcaCCCCCgaggtatgtaATATATATctacagagttcttga

Second, my query sequences are:

>another_sequence
tttccccctggaagctcccAcgtgcgctcGAGAGAGcgaccctgccgcttaccggatacctgtccgcctttctccctt
cgggaagcgtggcgctttctcatagctcacgctgtaggtGGGtcagttcggtgtaggtcgTATATATATcaagctgggctg
>my_sequence
gacttatcAAAactggcaGGGGGccactgCCCacaggattagcaCCCCCgaggtatgtaATATATATctacagagttcttga
>yet_another_sequence
agtggtggcctaactacggctGGGtagaagaacagtatttggtatctgcgctctgGGGgaagccagttaccttcggaaa
aagagttggtagctcttgatccTTTaaacaaaccaccgctggtagcggtggtttttATATATATagcagcagattacg

I don't change any parameters in these searches. The "max_target_sequences" should not apply here since I only get 4 hits for >my_sequence.

The results from the first search:

E-value Ident.  Accession
2e-06   78.05%  DQ977720.1
8e-04   75.61%  DQ977719.1
8e-04   75.61%  DQ977718.1
8e-04   75.61%  AB084167.1

The results from the second search:

E-value Ident.  Accession
4e-06   78.05%  DQ977720.1
0.002   75.61%  DQ977719.1
0.002   75.61%  DQ977718.1
0.002   75.61%  AB084167.1

As you can see, the E-values are completely different :<

Screenshot of the setup for the first search: enter image description here Screenshot of the results for the first search: Screenshot of the setup for the second search: Screenshot of the results from the second search:

ADD REPLY • link 4.6 years ago by johan ▴ 120

2

Entering edit mode

If I do this search locally using blast+ v.2.10 and nt I see no difference in the results with >my_sequence alone or in combination with multiple other sequences.

NCBI does things differently with the web interface and this may simply be a result of that. Send a ticket in to NCBI help desk if you want to understand why this is happening.

ADD REPLY • link 4.6 years ago by GenoMax 147k

0

Entering edit mode

Thanks. I've sent a ticket to NCBI help desk.

ADD REPLY • link 4.6 years ago by johan ▴ 120

0

Entering edit mode

you are correct, the problem I described should not actually modify the E-values, it will only change what is called best hit, for the same hit the E-Value should be the same.

I ran a few tests, I can produce the same inconsistency even with just two sequences and one can trigger it merely by listing one or the other sequence first. All E-values are affected for both sequences. Most unexpected and perhaps incorrect behavior.

Like genomax states, I would send an email to NCBI help desk. Make two files where you list one or the other first to help the process move faster. If you do perhaps you could also let us know what they say.

ADD REPLY • link 4.6 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks. I've sent a ticket to NCBI help desk.

ADD REPLY • link 4.6 years ago by johan ▴ 120

score 3 · Accepted Answer · 2020-05-11

Update. The NCBI help desk responded quickly, was able to replicate the bug, and quickly also addressed the bug.

For others that may have performed similar BLAST searches I paste their response below.

The developers have addressed this issue.

In summary:

• The problem only occurred for BLASTN/megaBLAST searches.

• It only happened if multiple queries were submitted at once. Results for the first query would be correct, but all other searches would use the search space for the first query instead of for each individual query.

• It only affected the web page. Stand-alone BLAST+ does not have this issue.

• Also are you aware of this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6662297/ This does have implications for E-values in some situations.