I have snippets of protein sequences and I need to find out to which accession numbers of the nr database they belong.
So far I tried to automate this process in the following way:
- accessing the NCBI webserver directly via the
NCBIWWW
function of theBio.Blast
module of Biopython - accessing the BLAST+ program via the
NcbiblastpCommandline
function of theBio.Blast.Applications
module of Biopython and using the- remote
argument
But both ways basically take forever. Do any of you have an idea how I can automate this without having to download the nr database of NCBI? Or is this really the only way?
I have around 600 snippets for a phylogenetic analysis of cyanobacterial proteins. Right now I'm running my code and so far it's taking 30 min and longer to create a single file. However, the first file was created pretty fast. It seems like a single run is relatively fast, but as soon as I chain many searches in a loop each of them takes really long.
This is my code
What parameters do you think could be changed to improve performance.
I guess otherwise I could just try downloading a subset of the nr database.Though, I'll have to figure out how.
be sure not to overload the NCBI servers with your requests. As Joe also pointed out if you submit too many concurrent requests you might get blacklisted by NCBI. You can avoid this to some extent by registering yourself at NCBI but even then you want be allowed to submit many requests.
The only thing you can do is try to adjust parameters to enable your searches to complete more quickly. Your E-value is very high for one thing (the default is 1E-6 I think), and you probably don't need to specify the word size or alignment parameters unless you have a very specific reason.
I think the NCBI polling rate is 5 queries per second or something even for guests, so you could parallelise your code to run up to 5 concurrent searches which will bring your overall run time down a fair bit but only to a point.
I am unfamiliar with using the python wrapper to blast+ command line. Regardless...
I question your use of a loop and expect you will get better performance overall by removing it.
If I were to call blast+ using
-remote
from the command line I would typically not call once for each input sequence but rather pass all input sequences in as a single multi-fasta file and expect combined results in a single output file. You might expect improved performance, since, typically a large portion of blast runtime is loading the db's index into RAM, and running multiple queries runs the risk of (probably) having to repeatedly load the same index which is a lot of io you're doing repeatedly. I say probably since the OS may cache, though I expect NCBI's servers are not caching nr (I expect this is untenable anyway giving nr's growth - caching is really only relevant to smaller blast dbs).Another related consideration is covered https://www.ncbi.nlm.nih.gov/books/NBK279668/#usermanual.Concatenation_of_queries
I'm not sure where originally published, but BLAST (Basic Local Alignment Search Tool) Chapter 12. Hardware and Software Optimizations has some good tips along these lines.
On a side note, I question your choice to return xml results. I recommend you look into outfmt=6, and it variants, returning tabular data which is easily parsed and in my experience contains all what is needed about the HSP results for most applications.
Finally, on another related note, now knowing your application, I might question your choice of database to search. You might consider to still conduct
-remote
search but learn to Limiting a Search by taxonomy - (note: I don't know whether python wrapper exposes this functionality - just use command line blast if not)Blast2GO may solve this issue.
As Lieven pointed out - there is nothing else you can really do.
By registering for an account I think you can increase the polling rate, but NCBI will set limits on the number of queries that you can send per unit time so that their network is not being hammered. No API/remote implementation will overcome the limits NCBI set.