Question

Entrez epost + elink returns results out of order with Biopython

1

Entering edit mode

10.9 years ago

Chris F. ▴ 20

I ran into this today and wanted to toss it out there. It appears that using the the Biopython interface to Entrez at NCBI, it's not possible to get results back (at least from elink) in the correct (same as input) order. Please see the code below for an example. I have thousands of GIs for which I need to get taxonomy information, and querying them individually is painfully slow due to NCBI restrictions.

from Bio import Entrez
Entrez.email = "my@email.com"
ids = ["148908191", "297793721", "48525513", "507118461"]
search_results = Entrez.read(Entrez.epost("protein", id=','.join(ids)))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
print Entrez.read(Entrez.elink(webenv=webenv,
                         query_key=query_key,
                         dbfrom="protein",
                         db="taxonomy"))

print "-------"

for i in ids:
    search_results = Entrez.read(Entrez.epost("protein", id=i))
    webenv = search_results["WebEnv"]
    query_key = search_results["QueryKey"]
    print Entrez.read(Entrez.elink(webenv=webenv,
                         query_key=query_key,
                         dbfrom="protein",
                         db="taxonomy"))

Results:

[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '211604'}, {u'Id': '81972'}, {u'Id': '32630'}, {u'Id': '3332'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['148908191', '297793721', '48525513', '507118461'], u'LinkSetDbHistory': [], u'ERROR': []}]
-------
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '3332'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['148908191'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '81972'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['297793721'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '211604'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['48525513'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '32630'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['507118461'], u'LinkSetDbHistory': [], u'ERROR': []}]

The elink documentation at NCBI says this should be possible, but only by passing multiple 'id=', but this doesn't appear possible with the Biopython epost interface. Has anyone else seen this or am I missing something obvious?

Thanks!

Note: this is a cross-post from StackOverflow at https://stackoverflow.com/questions/25775309/entrez-epost-elink-returns-results-out-of-order-with-biopython

python ncbi biopython • 7.8k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Chris F. ▴ 20

Ram · Answer 1 · 2014-09-10

0

Entering edit mode

10.9 years ago

Istvan Albert 103k

It seems that passing identical parameter names multiple times is not possible in BioPython epost since it passes them as dictionary.

On the other hand the entrez interface is a very thin layer over the eutils URLs that it accesses. It is very easy to build your own URL that populate parameters correctly. You could use a library like requests http://docs.python-requests.org/en/latest/ to make it super simple.

If that does not work then reordering the results is the next workaround - put your results into a dictionary keyed by the id then iterate on the original keys and pull the values from the dictionary. That should be no problem for data sizes of tens of thousands.

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Istvan Albert 103k

0

Entering edit mode

There are no identical URL parameter names - the ID parameter is held as a single string (comma separated), so where do you think the dictionary step (and loss of order) happens?

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Peter 6.0k

0

Entering edit mode

What the OP states that the EUtils documentation seems to recommend is that one can force a certain ordering by passing identical parameters like so:

query?id=1&id=2&id=3

In epost that does not seem to be not possible because the parameter is expected to be a dictionary

param=dict(id=1)

But it would be possible (the default python urlencode would support it) if the parameter were in the form of a a list of tuples like so

[('id', 1), ('id',2), ('id', 3)]

I have not actually checked the statement about ordering for validity - I just looked at how this worked as it interested me if it were possible to pass the identically named parameters since that appears to be a corner case of utility.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Istvan Albert 103k

0

Entering edit mode

Where doe the NCBI say the URL should repeat the id parameter like that? See http://www.ncbi.nlm.nih.gov/books/NBK25499/ which says for the the elink id argument "UID list. Either a single UID or a comma-delimited list of UIDs may be provided. ..." and for the epost id argument "UID list. Either a single UID or a comma-delimited list of UIDs may be provided."

To me this says we should build the URL using .../epost.fcgi?id=id1,id2&db=... instead of .../epost.fcgi?id=id1&id=id2&db=... (which older Biopython code used to do, but the NCBI started giving an Error 500 here so we changed to the comma separated list as of https://github.com/biopython/biopython/commit/f18361653531b48282cb73d221550d42612fbba9).

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Peter 6.0k

0

Entering edit mode

From that page, under the ELink section"

If more than one id parameter is provided, ELink will perform a separate link operation for the set of UIDs specified by each id parameter. This effectively accomplishes "one-to-one" links and preserves the connection between the input and output UIDs.

Find one-to-one links from protein to gene.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=gene&id=15718680&id=157427902&id=119703751

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Chris F. ▴ 20

0

Entering edit mode

I find that documentation misleading, to me the introduction says don't do http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=gene&id=15718680&id=157427902&id=119703751 (which gives the one-to-one results) but instead do http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=gene&id=15718680,157427902,119703751 which gives the muddled results. This is confusing :(

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Peter 6.0k

0

Entering edit mode

Thanks, Istvan. Yeah, I'll probably just build my own query, but I wanted to make this visible so it could be explored.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Chris F. ▴ 20

0

Entering edit mode

Given the two different modes of elink with multiple links, would you prefer Biopython always built its URL with the repeated &id=... bits in order to get the one-to-one mapping?

Or something like if you give Biopython a comma separated string it uses that as is (single &id=... in the URL as now) but if you give a list of IDs it uses multiple &id=... in the URL to get one-to-one mappings?

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Peter 6.0k

0

Entering edit mode

Thanks, Peter. Probably not always, but it would be nice to have the option ;-)

FWIW, I was able to get around my 1:1 problem, by using building my own elink URLs (with requests) and batching them, returning XML, and then parsing it with Entrez.read().

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Chris F. ▴ 20

0

Entering edit mode

Issue filed with Biopython elink URL construction, https://github.com/biopython/biopython/issues/361

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Peter 6.0k

Ram · Answer 2 · 2014-09-11

Because Python functions can only take a named argument once, you cannot do epost(..., id=id1, id=id2, ...) so instead we expect you to either use a list epost(..., id=my_id_list, ...) or as in your example a comma separated string epost(..., id=",".join(my_id_list), ...) which is what the code does internally if you use a list, see https://github.com/biopython/biopython/commit/f18361653531b48282cb73d221550d42612fbba9

As to the result order, that seems to be the down to the NCBI - print the raw XML and you get this:

<?xml version="1.0"?>
<!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 23 November 2010//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_101123.dtd">
<eLinkResult>
<LinkSet>
    <DbFrom>protein</DbFrom>
    <IdList>
        <Id>148908191</Id>
        <Id>297793721</Id>
        <Id>48525513</Id>
        <Id>507118461</Id>
    </IdList>
    <LinkSetDb>
        <DbTo>taxonomy</DbTo>
        <LinkName>protein_taxonomy</LinkName>
        <Link>
            <Id>211604</Id>
        </Link>
        <Link>
            <Id>81972</Id>
        </Link>
        <Link>
            <Id>32630</Id>
        </Link>
        <Link>
            <Id>3332</Id>
        </Link>
    </LinkSetDb>
</LinkSet>
</eLinkResult>

You are hoping for 148908191 --> 3332, 297793721 --> 81972, 48525513 --> 211604 and 507118461 --> 32630 here?

Update: Issue filed with Biopython elink URL construction, https://github.com/biopython/biopython/issues/361

Ram · Answer 3 · 2014-11-24

Looking into this when I was sorting out an approach to use ELink recently, I found that like Peter said, the result is down to NCBI.

If you had not tried to play nice and use the Entrez History Server, it would have worked.

If you look at this information under ELink Considerations you'll see that trying to use Webenv and a query_key from the Entrez History server causes them to be returned "as a group without information about which nucleotide record is linked to which protein record."

If you just skip the EPost step and send your list to ELink , it will work (as Peter discusses here and demos here).

Here is how you can keep the 1:1 correspondence:

from Bio import Entrez
Entrez.email = "A.N.Other@example.com"     # Always tell NCBI who you are. PUT YOUR EMAIL THERE.
protein_gi_numbers = ["148908191", "297793721", "48525513", "507118461"]
taxonomy_uids = []

#ELink step
handle = Entrez.elink(dbfrom="protein", db="taxonomy", id=protein_gi_numbers)
result = Entrez.read(handle)
handle.close()

#Mine the results
for each_record in result:
    taxonomy_id = each_record["LinkSetDb"][0]["Link"][0]["Id"]
    taxonomy_uids.append(taxonomy_id)

#Report    
#print result
print taxonomy_uids

Result:

['3332', '81972', '211604', '32630']

(You can see the code above run live in a fully interactive in-browser IPython console window here.)

My understanding from the Biopython Tutorial and Cookbook about the Entrez Guidelines is Biopython enforces that you can make no more than three requests per second. However, if you were going to use ELink on over 100 uids, you have to do it outside of peak times. I assume each record in the list (called 'protein_gi_numbers' here) actually counts as an individual request? Maybe Peter can comment on this?