I have to download a million protein seq from NCBI. I worked on a few line of code using also suggestions from here
When I test my code I get as a result an empty file:
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> from Bio import Entrez
>>> Entrez.email ="xxx@example.com"
>>> from urllib2 import HTTPError
>>> import time
>>> data = np.loadtxt('/home/xxx.txt', dtype="string")
>>> x=data[:1000]
>>> x=",".join(x)
>>> x_1=data[1001:2001]
>>> x_1=",".join(x_1)
>>> x_2=data[20002:3001]
>>> x_2=",".join(x_2)
>>> prot=(x, x_1, x_2)
>>> for c in iter(prot):
... try:
... handle = Entrez.efetch(db="protein", id=c, rettype="fasta", retmode="txt")
... except HTTPError:
... time.sleep(20)
... handle = Entrez.efetch(db="protein", id=c, rettype="fasta", retmode="txt")
... time.sleep(1) # to make sure not many requests go per second to ncbi
...
>>> handle.read()
'Supplied id parameter is empty.\n'
Where am I wrong?
Good point on the scale of this problem - I missed that, and agree an FTP download or similar might be best.
Note Biopython will pause automatically to rate limit the queries, so putting the sleep before a retry is sensible.
And yes, you are right that the example does not attempt to parse the data returned, but this is actually a good thing for clearly showing the error message from the NCBI in this case. Sadly the NCBI API fails to use an HTTP error status when things break :(
At least he got an error, I've had problems where the only way I knew my script had failed was because the NCBI website stopped working (was overloading the query rate). Biopython says that it limits the rate of queries, but I've always resorted to limiting it to one per second.
I wish NCBI would provide more explicit error messages, or that BioPython would handle them (if they already exist).
I wish the NCBI would give real error messages too. If they used HTTP error codes like TogoWS does, it would be so much easier (see also the TogoWS interface in Biopython).
Thanks very much for your comment! Actually, your last point was really a great idea! So if I understand correctly you suggest to download the NCBI bacteria.faa DB and then select in it only the seq with the ID I need. Is that correct?Any suggestion how could I do that?
Yep, that would do the trick. There's a ton of ways to go about this, you could use BioPython, awk, etc. You could even download the nr BLAST database and use blastdbcmd to pull the sequences you want. Just be sure that you consider the amount of memory you have, you may want to split the whole bacteria.faa file into multiple fasta files.
With Biopython, I would use
Bio.SeqIO.index(...)
orBio.SeqIO.index_db(...)
as described in the Biopython tutorial