Hi guys!
Imagine, we are searching over distributed blast databases, slices of the one big non-redundant database. We run separate blast task over each slice (piece) with the same query and search parameters. All the same for each task except database name. Then we obtain results - list of hits with alignment, E-value, bit-score (score) for each task and database slice. We need to display common results. We join all hits into one list and sort it by bit-score (score). Question is - HOW TO CALCULATE SUMMARY E-value for every hit?
Formula for calculating E-value is:
E-value = Eff-space / 2^bit-score
bit-score
is independent of database size and stay the same for a given hit, no matter if we are searching over whole database or just small piece of it.
I guess that E-value summary can be calculated so:
E-value summary = (Eff-space piece 1 + Eff-space piece 2 + .. + Eff-space piece N) / 2^bit-score
where N - number of slices (pieces)
Please, let me know if I am totally wrong and give advise.
PS: Another question is: can we somehow advise to searcher what E-value he should use knowing database size and query length to get at least one hit? This question appeared from my own practice when I have used small E-value cutoff searching over very big database using small query sequence.
Unfortunately Blast+ from NCBI does not have an input option for number of sequences in the original database.
I have modified the source codes of Blast + 2.3.0 to add this option. And it is possible now to use the fragments of database.