Question

Downloading protein seq from NCBI with biopython results in empty file

0

Entering edit mode

9.9 years ago

dago ★ 2.8k

I have to download a million protein seq from NCBI. I worked on a few line of code using also suggestions from here

When I test my code I get as a result an empty file:

Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> from Bio import Entrez
>>> Entrez.email ="xxx@example.com"
>>> from urllib2 import HTTPError
>>> import time
>>> data = np.loadtxt('/home/xxx.txt', dtype="string")
>>> x=data[:1000]
>>> x=",".join(x)
>>> x_1=data[1001:2001]
>>> x_1=",".join(x_1)
>>> x_2=data[20002:3001]
>>> x_2=",".join(x_2)
>>> prot=(x, x_1, x_2)
>>> for c in iter(prot):
...     try:
...             handle = Entrez.efetch(db="protein", id=c, rettype="fasta", retmode="txt")
...     except HTTPError:
...             time.sleep(20)
...             handle = Entrez.efetch(db="protein", id=c, rettype="fasta", retmode="txt")
...             time.sleep(1) # to make sure not many requests go per second to ncbi
...
>>> handle.read()
'Supplied id parameter is empty.\n'

Where am I wrong?

biopython NCBI sequence • 5.7k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by dago ★ 2.8k

Ram · Answer 1 · 2015-01-09

1

Entering edit mode

9.9 years ago

pld 5.1k

For x_2=data[20002:3001] did you mean 2002 instead of 20002?

Also the sleep(1) line should be one level out, you should pause for at least a second for each query, not just if the query fails.

Lastly, you need to use Entrez.read(handle) to get BioPython to parse the results into python objects, the value returned by Entrez.efetch is just XML data directly from NCBI.

A million sequences is a fairly large number for trying to go through Entrez, have you considered downloading bulk data from their FTP service and filtering it.

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by pld 5.1k

0

Entering edit mode

Good point on the scale of this problem - I missed that, and agree an FTP download or similar might be best.

Note Biopython will pause automatically to rate limit the queries, so putting the sleep before a retry is sensible.

And yes, you are right that the example does not attempt to parse the data returned, but this is actually a good thing for clearly showing the error message from the NCBI in this case. Sadly the NCBI API fails to use an HTTP error status when things break :(

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Peter 6.0k

0

Entering edit mode

At least he got an error, I've had problems where the only way I knew my script had failed was because the NCBI website stopped working (was overloading the query rate). Biopython says that it limits the rate of queries, but I've always resorted to limiting it to one per second.

I wish NCBI would provide more explicit error messages, or that BioPython would handle them (if they already exist).

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by pld 5.1k

0

Entering edit mode

I wish the NCBI would give real error messages too. If they used HTTP error codes like TogoWS does, it would be so much easier (see also the TogoWS interface in Biopython).

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Peter 6.0k

0

Entering edit mode

Thanks very much for your comment! Actually, your last point was really a great idea! So if I understand correctly you suggest to download the NCBI bacteria.faa DB and then select in it only the seq with the ID I need. Is that correct?Any suggestion how could I do that?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by dago ★ 2.8k

0

Entering edit mode

Yep, that would do the trick. There's a ton of ways to go about this, you could use BioPython, awk, etc. You could even download the nr BLAST database and use blastdbcmd to pull the sequences you want. Just be sure that you consider the amount of memory you have, you may want to split the whole bacteria.faa file into multiple fasta files.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by pld 5.1k

0

Entering edit mode

With Biopython, I would use Bio.SeqIO.index(...) or Bio.SeqIO.index_db(...)as described in the Biopython tutorial

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Peter 6.0k

Ram · Answer 2 · 2015-01-09

0

Entering edit mode

9.9 years ago

Peter 6.0k

The NCBI is trying to tell you your ID was missing, my guess is you have a blank line in your text file of identifiers?

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Peter 6.0k