Question

Biopython HTTPError when Fetching more than 1400 Entries from NCBI

0

Entering edit mode

4.9 years ago

tobias.kraft-blank • 0

Hello, The premise: I have alot of Genbank IDs, which i need to check for the "root" Organism. (*alot -> about 4 Million) The way i tried this so far is by using the EUtils with Biopython, first uploading my IDs to NCBI Servers with EPost. After that i try to receive the smallest XML File possible with EFetch for each ID and parse it. The Problem now is: I can only Fetch about 1400 IDs *(XMLs) from the Server. If i try to fetch more, the server does not respond. Is there a way to fix this ? Is the capability of the history server limited to 1400 IDs per session ? id_list is just a long list with Ids for testing (about 37k) , count is the length of the list

My Code:

def biopython_epost(id_list):
Entrez.email = "myMail@tum.de"
e_post = Entrez.epost (db = "nuccore", id=",".join(id_list) )
search_results = Entrez.read(e_post)
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
handle_ids={"WebEnv":webenv,"QueryKey":query_key} 
return handle_ids

def biopython_efetch(handle_ids, count):
    webEnv = handle_ids["WebEnv"]
    queryKey = handle_ids["QueryKey"]
    Entrez.email = "mymail@tum.de"
    Entrez.api_key = "myAPIKEY"
    batch_size = 200
    yeast_hits ={}
    for start in range (0, count, batch_size):
        print("Going to download record %i to %i" % (start+1, count))
        end = min(count, start+1)
        fetch_handle = Entrez.efetch(db="nucleotide", 
                                     rettype="docsum", 
                                     retmode="xml",
                                     retmax = batch_size,
                                     retstart = start,
                                     query_key=queryKey,
                                     webenv=webEnv)
        fetch_records = Entrez.parse(fetch_handle)
        for record in fetch_records:
            temp=record['Title'].split(' ')[0:4]
            yeast_info = ' '.join(temp)
            yeast_hits[yeast_info] = yeast_hits.get(yeast_info,0)+1
    return yeast_hits

Result:

Going to download record 1 to 37755
Going to download record 201 to 37755
Going to download record 401 to 37755
Going to download record 601 to 37755
Going to download record 801 to 37755
Going to download record 1001 to 37755
Going to download record 1201 to 37755
Going to download record 1401 to 37755
HTTPError: HTTP Error 400: Bad Request

Biopython EUtils NCBI • 2.0k views

ADD COMMENT • link updated 2.1 years ago by marcos.aguilella.3n • 0 • written 4.9 years ago by tobias.kraft-blank • 0

0

Entering edit mode

Have you signed-up for (and are using) NCBI_API_KEY? When doing a large search like this you should build in a delay so NCBI does not think you are spamming their server.

ADD REPLY • link 4.9 years ago by GenoMax 148k

0

Entering edit mode

Yes, i just deleted the API Key from my Post thats all, but i signed in and put in the Key with the request.

ADD REPLY • link 4.9 years ago by tobias.kraft-blank • 0

0

Entering edit mode

From my experience, you don't need epost. Just join eg 100 ids, and pass them to efetch. I wouldn't put much more id's at one time.

For example I'm using this construction to fetch fasta sequences without problem:

with Entrez.efetch(db='nucleotide', id=','.join(aclist), rettype='fasta', retmode='text') as h:
     ....

To my knowledge, there are some limits for ncbi databases - check out: https://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html

There may be errors (you are downloading data over the internet) - you need to handle errors.

ADD REPLY • link 4.9 years ago by massa.kassa.sc3na ▴ 650

0

Entering edit mode

Thank you for that link! Still strange, as it seems that 5k should be the minimum, that i can only submit about 1400. I tried catching HTML Errors, but to no avail, it keeps failing at 1400. - Error 400 is server sided anyways, so all i can do is wait and retry. Its just that every where i look in the documentation its mentioned you should absolutely use EPost for large datasets and many requests. Efetch can fetch 200 Ids at once, but to get about 4 Million IDs in total...well, i got the weekend ahead, hope they wont ban me.

ADD REPLY • link 4.9 years ago by tobias.kraft-blank • 0

0

Entering edit mode

Hi, did you find the solution to this problem? I'm having the same issue and I'm not finding any clear answer. Thank you so much!

ADD REPLY • link 2.1 years ago by marcos.aguilella.3n • 0