When perfoming a tblastn search, I often get mutiple HSP per sequence.
It seems the reported score and evalue for the sequence are those of the best HSP, but I could not be sure it is actually the rule used by blast. In particular two discussions confused me :
- https://github.com/davidemms/OrthoFinder/issues/144 -> Where it seems to be indeed the case
- How an alignment with multiple HSPs is evaluated in blastn, blastp? -> Where it seems NOT to be the case.
Has anyone more information about this ?
EDIT
So I have been doing some tests and we can at least stand that this is not the best HSP evalue that is used as sequence evalue. Here is an example :
Sequences producing significant alignments: (Bits) Value
Smik4:1077356-1078572 33.9 0.001
...
>Smik4:1077356-1078572
Length=1216
Score = 31.6 bits (70), Expect(2) = 4e-04, Method: Compositional matrix adjust.
Identities = 18/23 (78%), Positives = 18/23 (78%), Gaps = 2/23 (9%)
Frame = -1
Query 62 SFYMLHF--PITLSILAFQLPLN 82
SFYM F PITLSILAFQLPL
Sbjct 298 SFYMFLFIDPITLSILAFQLPLR 230
Score = 23.5 bits (49), Expect(2) = 4e-04, Method: Compositional matrix adjust.
Identities = 10/18 (56%), Positives = 13/18 (72%), Gaps = 0/18 (0%)
Frame = -2
Query 7 LSEVCVNFVIIIGIPLLI 24
LSE+ V F +I+GIP I
Sbjct 459 LSEIIVEFDVIVGIPFFI 406
Score = 33.9 bits (76), Expect = 0.001, Method: Compositional matrix adjust.
Identities = 20/43 (47%), Positives = 26/43 (60%), Gaps = 0/43 (0%)
Frame = -3
Query 51 LHTIILRLFFLSFYMLHFPITLSILAFQLPLNLLTLSQASFHL 93
LH + + + + H+PITLSILAFQLPL +T S HL
Sbjct 782 LHNSVYAFIVILYIVSHYPITLSILAFQLPLISMTASHNLCHL 654
The first two HSP have been grouped for sum statistics (see Expect(2)) and so their evalue is the same and it is better than the third HSP evalue (4e-04 < 0.001).
So here it seems that the HSP with the best best bit score was used for the sequence alignment. I have been perfoming mutliple tests and for now I could not find any case where this rule (keeping the HSP with the best bit score) does not work.
After several days of testing on a dozen of model species I cannot find a difference in the number of significant hits between the standard format and the tabular one when using the criterion : "the HSP with the best bit score represents the sequence alignment" and then filtering by a given evalue.
However It seems that in format 0; the ranking of the "Sequences producing significant alignments" is based on the evalue :
Query= YBL108W_mRNA
Length=102
Score E
Sequences producing significant alignments: (Bits) Value
orderedSpar9:416683-430241 60.5 1e-21
...
orderedSpar15:1048403-1049604 70.9 1e-16
orderedSpar3:310646-313260 63.9 3e-14
orderedSpar11:1342-3972 53.9 1e-10
orderedSpar9:408310-411662 52.8 2e-10
orderedSpar4:9673-13501 51.6 7e-10
orderedSpar2:8816-10868 51.2 1e-09
orderedSpar6:295181-298189 37.4 8e-05
scaffold_241:1672-4041 38.1 2e-08
...
Here the last two aligments do not seem to be ranked neither by score nor by evalue. The last score-evalue pair is actually based on an HSP whose displayed evalue has been corrected with sum statistics :
>scaffold_241:1672-4041
Length=2369
Score = 38.1 bits (87), Expect(2) = 2e-08, Method: Compositional matrix adjust.
...
Score = 31.6 bits (70), Expect(2) = 2e-08, Method: Compositional matrix adjust.
So I guess there are three steps here : first use the best HSP to represent the sequence, then if its (uncorrected ?) evalue is above the set threshold (here 1) report its score and evalue, and finally rank the sequences by uncorrected evalue.