(run this through a good markdown with mathjax to see the equations)
How the Expected Value is computed
It seems to be a little complicated. The expected value for a match of score $S$ is
$ E = K m n e^{-\lambda S} $. Here, $K$ and $\lambda$ are parameters dependent on the type of BLAST and substitution matrices rather than the content of the database and query. $m$ and $n$ are the effective lengths of the query and database, respectively.
The Statistical Parameters $K$ and $\lambda$
For any BLAST search, if you get the results in XML format (-outfmt 5
), there is a section on statistics, for example:
<Statistics>
<Statistics_db-num>189839</Statistics_db-num>
<Statistics_db-len>131151532</Statistics_db-len>
<Statistics_hsp-len>25</Statistics_hsp-len>
<Statistics_eff-space>47402083875</Statistics_eff-space>
<Statistics_kappa>0.46</Statistics_kappa>
<Statistics_lambda>1.28</Statistics_lambda>
<Statistics_entropy>0.85</Statistics_entropy>
</Statistics>
Here, what I am calling $K$ is Statistics_kappa
, and $\lambda$ is Statistics_lambda
.
Effective search space
Assume a database of $N$ sequences with an actual combined sequence length of $n_a$ being searched with one query sequence of length $m_a$. Now the naive calculation for a search space is $m_a n_a$. However, an alignment starting towards the end of one sequence can not possibly align the whole sequence if either the query or a database sequence runs out. Hence, the effective length of the database is $n = n_a - N \bar{l}$, and the effective query length is $m = m_a - \bar{l}$, where $\bar{l}$ is the length correction. The length correction is computed based on the lengths of query and database and given as Statistics_hsp-len
. The effective search space $m n = (m_a - \bar{l})(n_q - N \bar{l})$ and is shown as Statistics_eff-space
. Also, $N$ is Statistics_db-num
and $n_a$ is Statistics_db-len
. You can run some searches and verify the calculation of the effective search space, and the expected values based on the raw scores.
Problem with translated search
Now the analysis shown above is correct, I think for blastp and blastn. However, I am not able to figure out how the effective search space and exp. value is calculated for blastx and tblastx etc. Anyone who has any idea or can ask NCBI, please let us know here. Once we know how to calculate the search space sizes, perhaps we can compare the exp. values.
what does
--dbsize
really do in this context. Does it actually limit the length of the database sequence to a certain size?It doesn't limit or modify your database in anyway. It will just use the value given when calculating the e-value instead of the actual database size.
but in that case wouldn't that give you an E-value that does not correspond to reality? And if so it does not matter wether they are comparable or not since each is inaccurate in a different way.
I think I'll just use the bit score.
I agree that the e-value given with a specified -dbsize doesn't correspond to reality of the database size.
Let me give an example case. Let's say I want to blast my nucleotide query to a set of 100 protein sequences and a set of 50 nucleotide sequences. I want to use blastx for the 100 protein sequences and tblastx for the 50 nucleotide sequences. My database is really 150 sequences, but I have to split them into 2 due to protein/nucleotide content. So if I set -dbsize for both blastx and tblastx as 150, I should get comparable e-values? What if I did blastx and blastn?
I don't think '-dbsize' is the right choice for this. This feature probably only exists to test the limits on E-values in hypothetical best/worst case scenarios, but the values it produces are not meaningful otherwise.
What would be the way to compare the blastx and tblastx in this case? Surely you just can't compare the bit scores?
I don't know the answer yet, but for one, the
-dbsize
parameter in the blast command line is supposed to be the cumulative length of all the sequences in the database, rather than the number of sequences in the database.