Hi all,
I'm putting something together to return metadata for entries in the nucleotide database using e-utilities via biopython with the following function:
def fetch_nucleotide_info(query_list):
# Convert query list to comma-sep string
query_str = ','.join(query_list)
print(query_list)
# Submit list to epost
post_handle = Entrez.epost(db="nucleotide", id=query_str)
post_results = Entrez.read(post_handle)
post_handle.close()
webenv = post_results["WebEnv"]
query_key = post_results["QueryKey"]
# Get summaries from esummary
summary_handle = Entrez.esummary(db="nucleotide", webenv=webenv, query_key=query_key, version="2.0", retmode="json")
summary_data = json.load(summary_handle)
uids = summary_data['result']['uids']
for uid in uids:
notGI = summary_data['result'][uid].get('caption','')
description = summary_data['result'][uid].get('title','')
created = summary_data['result'][uid].get('createdate','')
updated = summary_data['result'][uid].get('updatedate','')
subtype = summary_data['result'][uid].get('subtype','')
subname = summary_data['result'][uid].get('subname','')
The problem I'm having is that the NCBI server is intermittently giving an error:
File "/Users/runner/miniforge3/conda-bld/python-split_1703348537777/work/Modules/pyexpat.c", line 461, in EndElement
File "/Users/me/opt/miniconda3/lib/python3.9/site-packages/Bio/Entrez/Parser.py", line 790, in endErrorElementHandler
raise RuntimeError(data)
RuntimeError: Some IDs have invalid value and were omitted. Maximum ID value 18446744073709551615
This is especially where the list to look up is larger than around 100 entries, but also often works without error using the exact same command a few seconds later, so it seems this is an issue with NCBI request server overload. As ideally this is going to be a scheduled process, it'd be better for it to run without error and without much interaction. The best workaround I can think of is to randomly subsample large lists to less than 100, but that's not ideal. Is there a more robust way of using e-utilities to search larger query lists?
So I tried increasing retmax to 10,000 and the error persisted, BUT it turns out that retmax for json output (which is what the code requests) is limited to 500, which largely explains this. Setting retmax to 500 stops the issue but also limits the output (as expected)
The version 2.0 xml output has a few known issues meaning the json is slightly more straightforward to parse using existing modules. So it can be completely solved but needs a bit more work.
But thanks anyway Pierre.