Combining stats of multiple BLAST queries
2
2
Entering edit mode
10.3 years ago
Vivek ▴ 50

Hi all,

I have a FASTA file with let's say following structure:

>Query_1
actgacgac....
>Query_2
gtacgatcagct...

I want to BLAST this FASTA file against a set of databases one by one and combine the results. The combining of alignments in results is a relatively easier task, but I wanted to know if we can combine the individual search statistics too so that the end result is same as if the file has been blasted against all of the databases at once.

I know it might sound stupid to BLAST one by one while we can blast against all databases at once, but I'm just being curious about the algorithm and its intricacies.

Thanks,

blast sequence alignment • 3.7k views
ADD COMMENT
1
Entering edit mode
10.3 years ago
Michael 55k

I think that approach can be somewhat simplified. At the moment, I cannot think of another statistik that would be influenced by the effective database size other than the E-value. That means, whether you blast against each database individually or a compound database should not differ in the result except for the e-value, while the e-value of the compound database would be the correct e-value for this search. Do you agree, in this case you could use blastdb_aliastool to make a compound database and the e-value will correctly the reflect the database size. What do you think?

Or you could use the parameter:

-dbsize <Int8>
   Effective length of the database

And set it to the size of the compound database for each individual db search. That might save some memory.

ADD COMMENT
0
Entering edit mode

For -dbsize <Int8> parameter, Wouldn't I have to know the size of compound database beforehand? As in the first part, where you are suggesting the use of blastdb_aliastool, won't it incur an overhead of another command execution?

ADD REPLY
1
Entering edit mode

He was suggesting two different options, force a specific database size, or make one large database.

The size of the compound database, should be the sum of the sizes of the individual databases.

ADD REPLY
1
Entering edit mode
10.3 years ago
pld 5.1k

The dbsize option allows one to set the effective size of the database being searched, this value only comes into play when calculating the result statistics. However, if you start changing this size your E-values no longer reflect reality. You could over or under estimate the significance of a hit by forcing a specific db size.

You really only have two options:

  1. Search each database individually and simply realizing that because you've used different databases the expect values aren't directly comparable.
  2. Combine all of your databases into a single database and search that.

Otherwise, you could ignore the e-value and base your analysis off of other metrics (bit score, identity). If you are looking for hits under a given threshold, I do not think there will be a problem. Say you only keep hits with expect values under some value. If you need to calculate distances between a query and multiple sequences (in different databases), or rank hits from different databases, you should make one large database, or not use blast.

ADD COMMENT
0
Entering edit mode

@joe.cronish826 Hi, I went with the -dbsize approach. Using blastdbcmd -list_format %l, I first obtained the length of all the databases, summed them up and then substituted with -dbsize to obtain close to exact E values. Is this approach correct? Why would the effective search space be a little different? Also, I noticed that L, K, H values didn't change. Could you confirm this intuition of mine?

ADD REPLY
2
Entering edit mode

The expect values are not "correct", the expect value is calculated as:

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Where n is the length of the sequence, m is the database size, S is the score of the sequence against a hit, and K, Lambda are constants provided by blast. In other words: the number of sequences you are likely to find with similar scores against your query by random chance in the database being searched.

The search space is the product of query sequence length and database size. By setting -dbsize, you are replacing the true value of m and are therefore generating an incorrect value for the search space. You can't really call your resulting expect values "exact" because, by definition, the expect value is in part a function of the size of the database searched. You have to understand this, unless the databases are all the same size, you will not be able to correctly generate expect values from hits in different databases that are directly comparable.

The other parameters probably did not change because like myself and Michael have said, the expect value is the only metric calculated by blast that is not a function of the search space.

If your goal is to compare the distance of some query sequence to multiple hits in other databases, you should use a different metric. If your goal is to find the best match for your query given a some set of sequences, you should put them all in the same database.

ADD REPLY
0
Entering edit mode

What I want to do is to determine which hit came from which database. The idea was that since I would be blasting against databases one by one, I would know which hits came from which databases. But this isn't giving out the correct results, one couldn't go ahead with this. What do you suggest?

ADD REPLY
0
Entering edit mode

I'm not sure what you mean by which hit comes from which database, the sequences in a blast database are known or can be found with blastdbcmd.

ADD REPLY

Login before adding your answer.

Traffic: 1561 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6