Hello,
I encountered a strange issue running blastn. I use the same set of query sequences in two scenarios:
1. Running against a DB of genomic contigs
2. Running against the same contigs, after they have been scaffolded into pseudomolecules
The second run is about x100 slower!
I should note that the pseudomolecules are pretty large - this is a plant genome with chromosomes each over 600 Mbp.
Does this even make sense? why would blast be slower when the sequences in the DB are longer? and is there any way I can improve performance?
I'd have just switched to Blast or DIAMOND, but this Blast run is invoked by BUSCO, so I don't really have a choice. The command run by BUSCO looks like:
tblastn -evalue 0.001 -num_threads 40 -query ancestral.fasta -db scaffolds.fasta -out tblastn.tsv -outfmt 7
Thanks!
dont you think the size of the search space should scale up the search time?
That's exactly the point - the search space size is the same, it's only arranged into fewer but larger sequences. Or maybe I didn't understand what you tried to say...
I suppose the search space is much wider than you think. Consider searching two 10-base sequences vs one 20-base sequence. If you're doing a sliding window of 9 bases, the first (2x10) has just four locations (left and right side of each)... while the 1x20 'database' has twelve (11?) ways to place the 9 base query. It's three times more searching and we havent even considered reverse-comp or partial matches or split matches. Allowing for gaps in the match (tophat style) will blow this scalar up. And I'm still not sure your two databases really have the same total base-count, 'scaffolded into molecules' doesn't mean one-to-one and could easily have duplications or repeats.
I see. That indeed makes sense, but it's still a bit surprising how bad the effect is. Running ~1600 queries against a genome about 50% larger than the human genome is taking more than two days using 40 CPUs. I still suspect that there's something else going on there. I'll update if I ever find out what.