I ran into this today and wanted to toss it out there. It appears that using the the Biopython interface to Entrez at NCBI, it's not possible to get results back (at least from elink) in the correct (same as input) order. Please see the code below for an example. I have thousands of GIs for which I need to get taxonomy information, and querying them individually is painfully slow due to NCBI restrictions.
from Bio import Entrez
Entrez.email = "my@email.com"
ids = ["148908191", "297793721", "48525513", "507118461"]
search_results = Entrez.read(Entrez.epost("protein", id=','.join(ids)))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
print Entrez.read(Entrez.elink(webenv=webenv,
query_key=query_key,
dbfrom="protein",
db="taxonomy"))
print "-------"
for i in ids:
search_results = Entrez.read(Entrez.epost("protein", id=i))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
print Entrez.read(Entrez.elink(webenv=webenv,
query_key=query_key,
dbfrom="protein",
db="taxonomy"))
Results:
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '211604'}, {u'Id': '81972'}, {u'Id': '32630'}, {u'Id': '3332'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['148908191', '297793721', '48525513', '507118461'], u'LinkSetDbHistory': [], u'ERROR': []}]
-------
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '3332'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['148908191'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '81972'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['297793721'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '211604'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['48525513'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '32630'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['507118461'], u'LinkSetDbHistory': [], u'ERROR': []}]
The elink documentation at NCBI says this should be possible, but only by passing multiple 'id=', but this doesn't appear possible with the Biopython epost interface. Has anyone else seen this or am I missing something obvious?
Thanks!
Note: this is a cross-post from StackOverflow at https://stackoverflow.com/questions/25775309/entrez-epost-elink-returns-results-out-of-order-with-biopython
There are no identical URL parameter names - the ID parameter is held as a single string (comma separated), so where do you think the dictionary step (and loss of order) happens?
What the OP states that the EUtils documentation seems to recommend is that one can force a certain ordering by passing identical parameters like so:
In epost that does not seem to be not possible because the parameter is expected to be a dictionary
But it would be possible (the default python urlencode would support it) if the parameter were in the form of a a list of tuples like so
I have not actually checked the statement about ordering for validity - I just looked at how this worked as it interested me if it were possible to pass the identically named parameters since that appears to be a corner case of utility.
Where doe the NCBI say the URL should repeat the id parameter like that? See http://www.ncbi.nlm.nih.gov/books/NBK25499/ which says for the the elink id argument "UID list. Either a single UID or a comma-delimited list of UIDs may be provided. ..." and for the epost id argument "UID list. Either a single UID or a comma-delimited list of UIDs may be provided."
To me this says we should build the URL using
.../epost.fcgi?id=id1,id2&db=...
instead of.../epost.fcgi?id=id1&id=id2&db=...
(which older Biopython code used to do, but the NCBI started giving an Error 500 here so we changed to the comma separated list as of https://github.com/biopython/biopython/commit/f18361653531b48282cb73d221550d42612fbba9).From that page, under the ELink section"
Find one-to-one links from protein to gene.
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=gene&id=15718680&id=157427902&id=119703751
I find that documentation misleading, to me the introduction says don't do http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=gene&id=15718680&id=157427902&id=119703751 (which gives the one-to-one results) but instead do http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=gene&id=15718680,157427902,119703751 which gives the muddled results. This is confusing :(
Thanks, Istvan. Yeah, I'll probably just build my own query, but I wanted to make this visible so it could be explored.
Given the two different modes of elink with multiple links, would you prefer Biopython always built its URL with the repeated
&id=...
bits in order to get the one-to-one mapping?Or something like if you give Biopython a comma separated string it uses that as is (single
&id=...
in the URL as now) but if you give a list of IDs it uses multiple&id=...
in the URL to get one-to-one mappings?Thanks, Peter. Probably not always, but it would be nice to have the option ;-)
FWIW, I was able to get around my 1:1 problem, by using building my own elink URLs (with requests) and batching them, returning XML, and then parsing it with
Entrez.read()
.Issue filed with Biopython elink URL construction, https://github.com/biopython/biopython/issues/361