Question

Avoid download limit from NCBI entrez with Biopython using API key

0

Entering edit mode

5.8 years ago

Shred ★ 1.5k

Hello guys, I'm trying to download several genomes from ncbi using Entrez module in Biopython. I'm using obviously an API key, but it seems that after 20 records, download it's stopped. It's just me? I don't see explicit limit in Entrez documentation. I'm expecting around 773 records, as obtained with this key in NCBI assembly.

Escherichia[organism] AND complete+genome

If answer would be yes to the first question, how can I implement some nice workaround to timeout?

As requested, here's the query used ..

Entrez.email = "my@email"
Entrez.api_key = "mykey"
search_term = "Escherichia[organism] AND complete+genome[title]"
handle = Entrez.esearch(db="nucleotide", term=search_term)
genome_ids = Entrez.read(handle)['IdList']

I'm currently see just an IdList made by 20 records.

biopython ncbi entrez • 5.3k views

ADD COMMENT • link 5.8 years ago by Shred ★ 1.5k

0

Entering edit mode

You can download them one at a time and make not to download a file twice by checking each download

ADD REPLY • link 5.8 years ago by Asaf 10k

0

Entering edit mode

Even with API keys there are limits per domain on how many connections can be made over a period of time. If someone else from your institution is connecting to NCBI this way their connections also count towards the total.

Edit: If you are using a proxy server to connect to internet then total number of connections counted towards that IP may be more than what you are thinking they are. If you are sharing an API key with someone else all connections are counted for they key.

Build in some kind of delay after download of each record so you don't hit the connection limits.

ADD REPLY • link 5.8 years ago by GenoMax 147k

0

Entering edit mode

Could you please post your entire query? The API key applies only to records fetched using eUtilities. If your ultimate goal is download genome sequence data, you should use FTP instead. If you can provide more information about what you are trying to download and the query you are using, I will be able to help you.

ADD REPLY • link 5.8 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

Added. Using a timeout with the sleep function could be a nice workaround, but I don't know how could be implemented in this case. I don't want to use FTP here because Entrez gives be directly the gbk file without doing extraction from gzip archives.

ADD REPLY • link 5.8 years ago by Shred ★ 1.5k

score 1 · Accepted Answer · 2019-02-04

1

Entering edit mode

5.8 years ago

Carambakaracho ★ 3.3k

the limits are 10 requests per second per API key, or 3 requests per second per IP adress without key. There's no institutional limit afaik. My scripts have a 1 second wait after each request, usually this is negligible compared to the ftp response time even with bacteria downloads

From NCBI Announce

Who needs to get a key?

For most casual use, you won’t need an API key at all – you only need to get one if you expect to access E-utilities at a rate of more than three requests per second from a single computer (IP address)

[...]

Now that I have a key, are there still access limits?

Yes. By default, your key will increase the limit to 10 requests/second for all activity from that key.

ADD COMMENT • link 5.8 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Okay but how could I put some delay between request? I've added above the code used.

ADD REPLY • link 5.8 years ago by Shred ★ 1.5k

1

Entering edit mode

Maybe the timeout is not even the root cause. I wanted to strongly suggest to work with WebENV, and found this help from the NCBI which might indicate your problem is the default retmax value. I used to work with retmax and incrementing retstart, but is can't retrieve the scripts (these were Perl, anyway). Try whether setting retmax to 1000 solves your problem.

retmax

Total number of UIDs from the retrieved set to be shown in the XML output (default=20). By default, ESearch only includes the first 20 UIDs retrieved in the XML output. If usehistory is set to 'y', the remainder of the retrieved set will be stored on the History server; otherwise these UIDs are lost. Increasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 100,000 records. To retrieve more than 100,000 UIDs, submit multiple esearch requests while incrementing the value of retstart (see Application 3).

Another alternative can be found in this old biostars thread

ADD REPLY • link 5.8 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Yeah, solved by using retmax and timeout error exception to handle timeout from server.

ADD REPLY • link 5.8 years ago by Shred ★ 1.5k

0

Entering edit mode

Hi Shred, great. BTW, beyond narcissism, you'll help figure people in the future if an answer was helpful by upvote/accept. Especially on such niche topics like the e-utils ;-)

ADD REPLY • link 5.8 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Shred : If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted.
Upvote|Bookmark|Accept

Please do the same for your previous posts as well.

ADD REPLY • link 5.8 years ago by GenoMax 147k