Hello,
I need to blastp a genome (15,000 seqs) against genome (12,000 seqs) using Biopython. I decided to use local blast and query genome 1 fasta file against genome 2 database ( made by makeblastdb command with second genome fast file ). I also managed to perform the blast search for default parameters of standalone blastp. However, when I try to change word size to BIGGER value ( default is 3 and i set it to 6, the blast performs extremely slow. I am kind of confused why such a thing happens because increasing word size is supposed to make things go faster. Here is how i pass arguments to NcbiblastpCommandline function:
NcbiblastpCommandline( word_size=6, query=queryInputPath, db=subjectInputPath, out=outputPath, outfmt=5 )()
things are much faster when the function does not have 'word_size=6' keyword argument. Without word size = 6 it takes around an 1,5 h to perform blast. My mac has 4gb of RAM and 1,6 GHz Intel Core i5 processor. What may be the cause?
Check that you're not running out of memory.
With 4GB of RAM very likely.
You may be able to save some overhead if you run BLAST directly from the command line, although not likely a meaningful amount. You may also try splitting the database up into multiple parts, just make sure you manually set the statistical options (e.g. dbsize). You'll have to do some post blast work to find the best hits, but this should get you around the memory issues.
Hi Aleksander, Long shot but did you ever figure out why increasing the word size slows down the search? I have the same problem with blastp version 2.11.0 and it does not look like I'm reaching any memory limit. Cheers, Henrietta