Question

blastn much slower on longer subject sequences?

0

Entering edit mode

4.3 years ago

liorglic ★ 1.5k

Hello,
I encountered a strange issue running blastn. I use the same set of query sequences in two scenarios:
1. Running against a DB of genomic contigs
2. Running against the same contigs, after they have been scaffolded into pseudomolecules
The second run is about x100 slower!
I should note that the pseudomolecules are pretty large - this is a plant genome with chromosomes each over 600 Mbp.

Does this even make sense? why would blast be slower when the sequences in the DB are longer? and is there any way I can improve performance?
I'd have just switched to Blast or DIAMOND, but this Blast run is invoked by BUSCO, so I don't really have a choice. The command run by BUSCO looks like:
tblastn -evalue 0.001 -num_threads 40 -query ancestral.fasta -db scaffolds.fasta -out tblastn.tsv -outfmt 7

Thanks!

blast blastn • 1.4k views

ADD COMMENT • link updated 4.3 years ago by JC 13k • written 4.3 years ago by liorglic ★ 1.5k

0

Entering edit mode

dont you think the size of the search space should scale up the search time?

ADD REPLY • link 4.3 years ago by karl.stamm 4.1k

0

Entering edit mode

That's exactly the point - the search space size is the same, it's only arranged into fewer but larger sequences. Or maybe I didn't understand what you tried to say...

ADD REPLY • link 4.3 years ago by liorglic ★ 1.5k

1

Entering edit mode

I suppose the search space is much wider than you think. Consider searching two 10-base sequences vs one 20-base sequence. If you're doing a sliding window of 9 bases, the first (2x10) has just four locations (left and right side of each)... while the 1x20 'database' has twelve (11?) ways to place the 9 base query. It's three times more searching and we havent even considered reverse-comp or partial matches or split matches. Allowing for gaps in the match (tophat style) will blow this scalar up. And I'm still not sure your two databases really have the same total base-count, 'scaffolded into molecules' doesn't mean one-to-one and could easily have duplications or repeats.

ADD REPLY • link 4.3 years ago by karl.stamm 4.1k

0

Entering edit mode

I see. That indeed makes sense, but it's still a bit surprising how bad the effect is. Running ~1600 queries against a genome about 50% larger than the human genome is taking more than two days using 40 CPUs. I still suspect that there's something else going on there. I'll update if I ever find out what.

ADD REPLY • link 4.3 years ago by liorglic ★ 1.5k

score 2 · Answer 1 · 2020-08-13

2

Entering edit mode

4.3 years ago

JC 13k

My guess is you are expanding the search space using larger sequences, this can change the way blast identify the initial hits and try to extend them. Also could be the memory, if you are using a limited memory machine, loading larger sequences can increase the mem used and your system could be using disk space to fit it.

ADD COMMENT • link 4.3 years ago by JC 13k

0

Entering edit mode

I don't think there's a memory issue - the job is limited to 100g RAM, and it's actually using 38g. In general, it seems like the machine is doing fine - lot's of free RAM, no swapping and in fact it looks like most of the time Blast is only using one CPU (?).
Regarding your other suggestion - is there anything I can do about it? are there parameters that control it?

ADD REPLY • link 4.3 years ago by liorglic ★ 1.5k

1

Entering edit mode

on the CPU usage: majority of the blast process is actually single threaded, it's only a very small part that is effectively multi-threaded.

ADD REPLY • link 4.3 years ago by lieven.sterck 15k