Why does the BLAST use E-value instead of p-value?
1
1
Entering edit mode
9.6 years ago
mangfu100 ▴ 810

Hi all.

I think that p-value is one of the most greatest way of measuring degree of observed data.

However, BLAST doesn't use p-value but E-value.

Why the BLAST use e-value for interpreting sequence data instead of p-value?

Is there any logical reason to use E-value for BLAST? If so, could you tell me the detail reason?

sequencing alignment • 17k views
ADD COMMENT
11
Entering edit mode
9.6 years ago

Quote from the BLAST help (http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html#head4 ):

"The BLAST programs report E-value rather than P-values because it is easier to understand the difference between, for example, E-value of 5 and 10 than P-values of 0.993 and 0.99995. However, when E < 0.01, P-values and E-value are nearly identical."

Important to note that P value of the BLAST is not the same thing than a P-value of a t-test.

ADD COMMENT
1
Entering edit mode

Could you elaborate further on your last sentence?

ADD REPLY
1
Entering edit mode

any p-value is the result of a hypothesis test. since a blast search is not a hypothesis test, a p would be an inappropriate result.

ADD REPLY
3
Entering edit mode

Yes, BLAST is doing a hypothesis test: is the sequence a homolog of your query, or not? The null hypothesis is that it is not a homolog, and instead is a "random" sequence. The P-value is the probability that you would've gotten a score this high if it's not a homolog. BLAST scores follow a known distribution (an extreme value distribution) under the null hypothesis. Conceptually, it's the same as any other p-value based significance test.

ADD REPLY
0
Entering edit mode

I think most users aren't aware of the hypothesis test as you've stated it. Implicitly, BLAST is testing a query sequence against thousands or thousands of millions of candidate sequences. If we interpret p-value as the false positive rate (or incorrect null-h acceptance), then we should apply a multiple-testing correction to the result, and the copious results are decimated. The chance of artificial alignment is highly dependent upon the genome being searched and the complexity of the query sequence. We can guarantee that a 2-mer is a homolog of a million locations, but it's useless as a result. the E-value distribution accounts for these things and is more directly related to the complexity and uniqueness of a blast 'hit'. It's determined by the genome index being queried. We use BLAST to find things, and want to know how certain it is. I think most users aren't specifying any hypotheses or accounting for the multiplicity thereof.

ADD REPLY
2
Entering edit mode

seanrobertseddy (Sean Eddy? Hello!) is right here. BLAST is doing a standard hypothesis test. It has an explicit null model and the E-value is estimated based on this model. You may argue whether the null model is appropriate, but math is math. As I remember, BLAST precomputes the two key parameters. FASTA/swat learns the parameters from data. They are less affected by the redundancy in the database.

ADD REPLY
0
Entering edit mode

Exactly: BLAST P-value: "The probability of a chance alignment occurring with a particular score or a better score in a database search." Quoted form BLAST Glossary

More exactly: If you have an n length query sequence and an m length database and running BLAST you get a hit with S score, than the P value is the probability of you get at least one hit with a score greater (or equal) than S if you BLAST a random n length query against a random m length database.

The last state are concluded mostly from: http://www.basiclocalalignmentsearchtool.com/

However P-value is not calculated by BLAST but E-value. P value is not equal with E-value. BLAST E-value is the expectation value of the hits with score greater (or equal) than S if you BLAST a random n length query against a random m length database.

ADD REPLY
0
Entering edit mode

I agree. The last statement requires further elaboration otherwise it might be misleading. Did you meant to say that the underlying distribution is different?

ADD REPLY

Login before adding your answer.

Traffic: 1950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6