Hi all,
I'm trying to blast around 10,000 protein sequences against nr with blastp. In the past, using 100-sequence chunks and a single CPU each had worked well for blastn, but blastp seems to be much slower. A .fasta file with 100 sequences, running on a single core has not yet produced an output in 55 minutes.
I have BLAST+ installed in an HPC environment, with the datasets downloaded and indexed appropriately. I have tried blasting only one sequence using 16 cores:
blastp -query sequence.fasta -db nr -out test -outfmt 7 -num_threads 16
and it took around 10 minutes. The same sequence takes about a minute to process on the blast web server. I know it should go faster (per sequence) if I blast multiple sequences at once. Is there a way I can figure out what the optimum ratio of # of sequences vs. # of cores would be (other than trial and error, I guess)? I have access to 1000 CPUs at once, so it would be nice to find a decent balance.
Also, why is the web server much faster? Does it bundle together multiple queries or something? Or does our local blast setup potentially suffer from disk I/O issues?
Thank you for the explanation! I managed to get diamond to work, but I'm having trouble getting it to run faster than blast. For a single test sequence, local blastp takes about 5 minutes, while diamond blastp took a little over 15 mins. Running the same sequence on the blast web server took 10 seconds or so. So I guess now the question becomes: how do I optimize the ratio of number of sequences per file and the number of CPUs for diamond...
I have a file with 100 sequences running with diamond, but I'll have to wait a lot longer to see how long it takes. Blastp was able to do ~40 sequences in 16 hours with a single CPU.
Don't try to compare anything local with NCBI's web blast infrastructure, for you will always come up short :-)
Sounds like you are lucky to have reasonably adequate hardware. DIAMOND actually works well with large number of sequence so put all 10000 in one query
There are some additional notes for distributed computing as well: https://github.com/bbuchfink/diamond/wiki/6.-Distributed-computing
Thank you for the info! It turned out that running all 10k at once is actually faster than running one or two at a time!