Question

Blast E-Value To Database Size

2

Entering edit mode

13.2 years ago

Hranjeev ★ 1.5k

Hi,

I'm thinking of splitting the database to smaller chunks. And, blast my sequences against them each on a separate process. My only concern is the results (which I will merge later).

Would the resulting e-value be affected by database content when smaller subsets are used? I have a hunch that it would not matter when all the subset results later becomes concatenated. Please correct me if I'm wrong.

blast statistics • 16k views

ADD COMMENT • link updated 13.2 years ago by Neilfws 49k • written 13.2 years ago by Hranjeev ★ 1.5k

score 12 · Answer 1 · 2012-02-13

12

Entering edit mode

13.2 years ago

Neilfws 49k

The statistics of BLAST scores are described in this article. It's quite mathematics-heavy, but also quite readable; just take your time and re-read several times.

The short answer is that yes, e-values are dependent on database size. If you think about it intuitively, there's a higher probability of finding a match in a large database than in a smaller database.

That said, it is possible to re-calculate e-values by combining the results when the database is split. This is implemented in, for example, mpiBLAST. It would be a good idea to study their website, code and publication to see how they handle the problem.

See also the discussion of recalculating e-value in this paper or do a quick web search for "BLAST split database e-value calculate" - it's quite a widely-discussed issue.

ADD COMMENT • link 13.2 years ago by Neilfws 49k

4

Entering edit mode

Totally agree with the answer but you can set manually the database size using the parameter "-z". On that way, you can split the db file into smaller pieces, make your queries and then merge results.

ADD REPLY • link 13.2 years ago by scapella ▴ 390

1

Entering edit mode

I think you also need to set the number of sequences in the database (N) to calculate the edge adjustment parameter (l or "ell"). The adjustment is done for you if you use NOBLAST

ADD REPLY • link 12.3 years ago by colinDotAIBN ▴ 20

1

Entering edit mode

try grep -v '^>' something.fasta | grep -o [ACTGNactg] | wc -l for fasta files before building database

ADD REPLY • link 13.2 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

thanks much appreciated

ADD REPLY • link 13.2 years ago by Sequer ▴ 150

0

Entering edit mode

Oops my hunch was wrong. Anyways any easy way to count the number of letters, N (total letters) of a database?

ADD REPLY • link 13.2 years ago by Hranjeev ★ 1.5k

0

Entering edit mode

Please validate if the nr database atm is 5784003470 letters in size

ADD REPLY • link 13.2 years ago by Hranjeev ★ 1.5k

0

Entering edit mode

When I said "database size" I refered to the total number of sequences in your database, I didn't refer to the total number of residues on it.

ADD REPLY • link 13.2 years ago by scapella ▴ 390