Question

Why The Default Expect Threshold In Pblast Is 10?

4

Entering edit mode

12.8 years ago

Reyhaneh ▴ 530

Hi;

In pBlast the Expect Value is used to limit the number of scores and alignments for reporting matches against the database sequences.

The default value is assigned 10 (NCBI Document: link text) which means that 10 matches are expected to be found by chance for a specific query based on a stochastic model of Karlin and Altschul (Paper: link text).

Does anyone knows why the default value has been assigned 10? What is the logic behind this value? Do you know any papers discussing this issue?

Thanks in advance for you help.

blast blast • 6.0k views

ADD COMMENT • link updated 12.8 years ago by Larry_Parnell 16k • written 12.8 years ago by Reyhaneh ▴ 530

2

Entering edit mode

I don't believe that there's any logic to it - it's just an arbitrary choice. Anything higher than 1 is unlikely to be a related sequence. Perhaps users are uncomfortable when BLAST returns no matches, so they chose a value likely to return something, even if insignificant, under most scenarios :)

ADD REPLY • link 12.8 years ago by Neilfws 49k

score 10 · Answer 1 · 2012-03-02

The default E()-value (expect) for proteins for BLAST (and FASTA) reflects the goal of providing the investigator a chance to see the "transition" between related and unrelated sequences as you look down the list. While it is true that unrelated sequences begin to appear around E() < 1.0 (in 1% of searches, they should appear at E() < 0.01), for diverse protein families, there will be many related sequences with E()-values in this range as well. Indeed, for very large and diverse protein families, there will be many more homologs with E() between 1 and 10 than unrelated sequences. By E() ~ 10, however, many more of the scores will be unrelated.

Note that E() ~ 10 makes sense for protein:protein scores, but it makes less sense for translated-DNA:protein searches (BLASTX, FASTX) or DNA:DNA scores (BLASTN, FASTA). In the FASTA programs, the default values are 5.0 for FASTX/FASTY and 2.0 for FASTA/DNA. (BLAST always uses 10.) This reflects the less robust accuracy of those expect values. Because of out-of-frame translations (which can produce low E()-values against low complexity regions) and local DNA composition bias, more scores at E() < 5.0 or E() < 2.0 (DNA:DNA) are likely to be unrelated.

score 1 · Answer 2 · 2012-02-28

1

Entering edit mode

12.8 years ago

Larry_Parnell 16k

My understanding is this is partly arbitrary, as Neil suggests, part also historical. It has been that way for a long time, at least 20 years. "It's been that way for so long, likely no one really know why."

Added in edit 2 Mar 2012: Given the much better response supplied by Bill Pearson, I should simply delete my response. There is a historical aspect to why a value of 10 is applied, but those who developed the similarity search tools we are still using do know why. As to the arbitrary part of my answer - that is wrong.

ADD COMMENT • link 12.8 years ago by Larry_Parnell 16k

0

Entering edit mode

Thanks for this. it makes more sense to work with sequences with evalue less than 1 then.

ADD REPLY • link 12.8 years ago by Reyhaneh ▴ 530