Question

Does BLAST keep only the best HSP score/evalue as sequence score/evalue ?

1

Entering edit mode

2.6 years ago

PRog ▴ 10

When perfoming a tblastn search, I often get mutiple HSP per sequence.

It seems the reported score and evalue for the sequence are those of the best HSP, but I could not be sure it is actually the rule used by blast. In particular two discussions confused me :

https://github.com/davidemms/OrthoFinder/issues/144 -> Where it seems to be indeed the case
How an alignment with multiple HSPs is evaluated in blastn, blastp? -> Where it seems NOT to be the case.

Has anyone more information about this ?

EDIT

So I have been doing some tests and we can at least stand that this is not the best HSP evalue that is used as sequence evalue. Here is an example :

Sequences producing significant alignments:                          (Bits)  Value
Smik4:1077356-1078572                                                 33.9    0.001
...

>Smik4:1077356-1078572
Length=1216

 Score = 31.6 bits (70),  Expect(2) = 4e-04, Method: Compositional matrix adjust.
 Identities = 18/23 (78%), Positives = 18/23 (78%), Gaps = 2/23 (9%)
 Frame = -1

Query  62   SFYMLHF--PITLSILAFQLPLN  82
            SFYM  F  PITLSILAFQLPL 
Sbjct  298  SFYMFLFIDPITLSILAFQLPLR  230


 Score = 23.5 bits (49),  Expect(2) = 4e-04, Method: Compositional matrix adjust.
 Identities = 10/18 (56%), Positives = 13/18 (72%), Gaps = 0/18 (0%)
 Frame = -2

Query  7    LSEVCVNFVIIIGIPLLI  24
            LSE+ V F +I+GIP  I
Sbjct  459  LSEIIVEFDVIVGIPFFI  406


 Score = 33.9 bits (76),  Expect = 0.001, Method: Compositional matrix adjust.
 Identities = 20/43 (47%), Positives = 26/43 (60%), Gaps = 0/43 (0%)
 Frame = -3

Query  51   LHTIILRLFFLSFYMLHFPITLSILAFQLPLNLLTLSQASFHL  93
            LH  +     + + + H+PITLSILAFQLPL  +T S    HL
Sbjct  782  LHNSVYAFIVILYIVSHYPITLSILAFQLPLISMTASHNLCHL  654

The first two HSP have been grouped for sum statistics (see Expect(2)) and so their evalue is the same and it is better than the third HSP evalue (4e-04 < 0.001).

So here it seems that the HSP with the best best bit score was used for the sequence alignment. I have been perfoming mutliple tests and for now I could not find any case where this rule (keeping the HSP with the best bit score) does not work.

After several days of testing on a dozen of model species I cannot find a difference in the number of significant hits between the standard format and the tabular one when using the criterion : "the HSP with the best bit score represents the sequence alignment" and then filtering by a given evalue.

However It seems that in format 0; the ranking of the "Sequences producing significant alignments" is based on the evalue :

        Query= YBL108W_mRNA

        Length=102
                                                                              Score     E
        Sequences producing significant alignments:                          (Bits)  Value

        orderedSpar9:416683-430241                                            60.5    1e-21
    ...
        orderedSpar15:1048403-1049604                                         70.9    1e-16
        orderedSpar3:310646-313260                                            63.9    3e-14
        orderedSpar11:1342-3972                                               53.9    1e-10
        orderedSpar9:408310-411662                                            52.8    2e-10
        orderedSpar4:9673-13501                                               51.6    7e-10
        orderedSpar2:8816-10868                                               51.2    1e-09
        orderedSpar6:295181-298189                                            37.4    8e-05
        scaffold_241:1672-4041                                                38.1    2e-08
    ...

Here the last two aligments do not seem to be ranked neither by score nor by evalue. The last score-evalue pair is actually based on an HSP whose displayed evalue has been corrected with sum statistics :

>scaffold_241:1672-4041
Length=2369

 Score = 38.1 bits (87),  Expect(2) = 2e-08, Method: Compositional matrix adjust.
...
 Score = 31.6 bits (70),  Expect(2) = 2e-08, Method: Compositional matrix adjust.

So I guess there are three steps here : first use the best HSP to represent the sequence, then if its (uncorrected ?) evalue is above the set threshold (here 1) report its score and evalue, and finally rank the sequences by uncorrected evalue.

BLAST • 561 views

ADD COMMENT • link 2.5 years ago by PRog ▴ 10