Question

Blastn Gives Different Results Based on Database Size

0

Entering edit mode

6.5 years ago

khv • 0

I'm trying to figure out why I am seeing different blast results based on database size. Here's what I'm trying to accomplish:

I have a thousand or so sequences of approximately 2000 bp in length. I want to subdivide the sequences into groups that don't share more than 13 bp of homology to other sequences in the group. I want to use blast to identify homologies between sequences.

First I create a local database from a fasta file of all sequences. Then I run a blastn query of the fasta file against the database.

makeblastdb -in seqlist.fasta -dbtype 'nucl' -out seqlist
blastn -task blastn -db seqlist -query seqlist.fasta -word_size 13 -out results.txt -outfmt 6

In the resulting output, the smallest region of homology found is 15 bp. Based on the blast result, I make subgroups of 20 or so sequences that do not share homology. To check my work, I repeat the blast process using only a single subgroup.

makeblastdb -in seqlist2.fasta -dbtype 'nucl' -out seqlist2
blastn -db seqlist2 -query seqlist2.fasta -word_size 13 -out results2.txt -outfmt 6

Now when I run the same query on a database of 20 or so sequences instead of 1000 or so sequences, the blast output finds all sorts of regions of homology of length 13 and 14 bp. I'm trying to understand why these outputs did not appear in the original blast query. Does blast use a different algorithm based on database size? Is there a parameter I can pass to change this?

Per some forum posts I have found on using blastn to search for short alignments, I have tried including parameters like

-dust no
-soft_masking false
-task blastn-small

None of these parameters get the large database query to output the 13 bp regions of homology found in the small database query. Additionally reducing the word search size doesn't help. Any advice or information on this would be appreciated.

alignment sequence blast • 2.2k views

ADD COMMENT • link updated 6.5 years ago by lieven.sterck 15k • written 6.5 years ago by khv • 0

0

Entering edit mode

an initial comment (to which I'm quite sensitive):

I want to subdivide the sequences into groups that don't share more than 13 bp of homology to other sequences in the group. I want to use blast to identify homologies between sequences.

what you are looking for is similarities !! you can not have 13bp homology !

Homology is a 'boolean' thing (= yes or no) , there is no such thing as more or less or percentage homology.

ADD REPLY • link 6.5 years ago by lieven.sterck 15k

score 3 · Answer 1 · 2018-06-18

3

Entering edit mode

6.5 years ago

lieven.sterck 15k

makes all perfect sense (except the homology part, cfr comment above ;)

the database size influences the HSP scoring and even more the e-value calculation. It is very likely that doing the blast on the small DB gives more (or other) hits than the big one, especially since you use the same score threshold.

Yes, there is a parameter you can set to avoid this behavior, namely the following two:

-dbsize <Int8>
   Effective length of the database
 -searchsp <Int8, >=0>
   Effective length of the search space

these set the DB size fixed and you will thus end up with the same scoring stats regardless of the actual size of the DB. To set them have a look at the output of the large DB blast where it says: blastdb-size (or such ) and use the same value when doing the small db blast. Personally I would also set the -ungapped parameter

ADD COMMENT • link 6.5 years ago by lieven.sterck 15k

0

Entering edit mode

Thanks for the reply, this solved the issue

ADD REPLY • link 6.5 years ago by khv • 0

0

Entering edit mode

-max_target_seqs (default 500) can also be a thing here..

ADD REPLY • link 6.5 years ago by 5heikki 11k