Different results with BLAST depending on if subject is formatted database or FASTA-file
1
0
Entering edit mode
9.7 years ago
driliwyr ▴ 10

I want to BLAST one protein sequence Q6GZX4.fasta against all the sequences in the file part0.fasta (FASTA format) containing 5000 sequences. First I tried using the part0.fasta directly as subject. Then I tried using a formatted database version of it (makeblastdb -in part0.fasta -title part0 -dbtype prot -out part0 -parse_seqids).

Using -query Q6GZX4.fasta and -subject part0.fasta (case A) [output is line count]:

user% blastp -query Q6GZX4.fasta -subject part0.fasta -evalue 100 -max_target_seqs 5000 -max_hsps 1 -outfmt 6|wc -l                     
4572

Using -query Q6GZX4.fasta and -db part0 (case B) [output is line count]:

user% blastp -query Q6GZX4.fasta -db part0 -evalue 100 -max_target_seqs 5000 -max_hsps 1 -outfmt 6|wc -l
​43

Why do I get different results? 4572 hits in case A, but 43 in case B?

blast • 3.7k views
ADD COMMENT
0
Entering edit mode

What happens when you run blastdbcmd -info -db part0?

ADD REPLY
0
Entering edit mode
user% blastdbcmd -info -db part0
Database: part0
        5,000 sequences; 1,826,734 total residues

Date: Mar 31, 2015  5:44 PM     Longest sequence: 5,058 residues

Volumes:
        /home/user/part0
ADD REPLY
1
Entering edit mode
9.7 years ago
rtliu ★ 2.2k

blastp with -subject parameter is actually to blast 2 sequences, therefore it must be more sensitive than blastp with '-db' parameter, because db could be very large, the software has to balance speed with sensitivity.

ADD COMMENT
1
Entering edit mode

Is there any documentation regarding this behaviour? Could I increase the sensitivity when using the database case, so I can utilize multi-thread with -num_threads but still have the sensitivity?

ADD REPLY
0
Entering edit mode

I can't think of any relevant documentation, you may try blast help

To get best answer, Email blast-help

ADD REPLY
0
Entering edit mode

I was looking for the same thing when he first posted this, wasn't able to find anything about it.

OP, in your results do you see anything different about the search space?

The best clue I can find to what is going on is in the documentation in blast_setup.c:

    /* If database (subject) length is not available at this stage, and
     * overriding value of effective search space is not provided by user,
     * do nothing.
     * This situation can occur in the initial set up for a non-database search,
     * where each subject is treated as an individual database. 
     */
    if (db_length == 0 &&
        ![BlastEffectiveLengthsOptions_IsSearchSpaceSet](http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/blast__options_8h.html#ade3e92645c6cd8e1c9a2a784300860e7)(eff_len_options)) {
       return 0;
 }

So I think that in the case of using -subject, each sequence in the file provided is treated as a separate blast database. I could be completely wrong.

ADD REPLY
0
Entering edit mode

It sounds reasonable that it treats each sequence as an individual database in case A. Thanks for taking the time.

Changing the -searchsp parameter increases the results in case B:

user% blastp -query Q6GZX4.fasta -db part0 -evalue 100 -max_target_seqs 5000 -max_hsps 1 -searchsp 1 -outfmt 6|wc -l
4643

What implications does the -searchsp (and -dbsize) parameters have in this case?

ADD REPLY

Login before adding your answer.

Traffic: 1826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6