I am trying to run some commands in Python using the urllib library in order to access the API of the Uniprot protein database. All the code does is send in the Uniprot IDs and get back the pertinent gene names (e.g. P08603 would correspond with the gene name of CFH). The code has been working fine for all my other projects, but I am having difficulty with this one new input, even though it has the same format of content as previous projects.
import re
import pandas as pd
import time
from urllib.parse import urlparse, parse_qs, urlencode
import requests
from requests.adapters import HTTPAdapter, Retry
POLLING_INTERVAL = 3
API_URL = "https://rest.uniprot.org"
retries = Retry(total=5, backoff_factor=0.25, status_forcelist=[500, 502, 503, 504])
session = requests.Session()
session.mount("https://", HTTPAdapter(max_retries=retries))
retries = Retry(total=5, backoff_factor=0.25, status_forcelist=[500, 502, 503, 504])
session = requests.Session()
session.mount("https://", HTTPAdapter(max_retries=retries))
def submit_id_mapping(from_db, to_db, ids):
r = requests.post(
f"{API_URL}/idmapping/run",
data={"from": from_db, "to": to_db, "ids": ",".join(ids)},
)
r.raise_for_status()
print(r)
return r.json()["jobId"]
def get_id_mapping_results_link(job_id):
url = f"{API_URL}/idmapping/details/{job_id}"
r = session.get(url)
r.raise_for_status()
return r.json()["redirectURL"]
def check_id_mapping_results_ready(job_id):
while True:
r = session.get(f"{API_URL}/idmapping/status/{job_id}")
r.raise_for_status()
j = r.json()
if "jobStatus" in j:
if j["jobStatus"] == "RUNNING":
print(f"Retrying in {POLLING_INTERVAL}s")
time.sleep(POLLING_INTERVAL)
else:
raise Exception(r["jobStatus"])
else:
return bool(j["results"] or j["failedIds"])
def combine_batches(all_results, batch_results, file_format):
if file_format == "json":
for key in ("results", "failedIds"):
if batch_results[key]:
all_results[key] += batch_results[key]
else:
return all_results + batch_results
return all_results
def decode_results(response, file_format):
if file_format == "json":
return response.json()
elif file_format == "tsv":
return [line for line in response.text.split("\n") if line]
return response.text
def get_id_mapping_results_search(url):
parsed = urlparse(url)
query = parse_qs(parsed.query)
file_format = query["format"][0] if "format" in query else "json"
if "size" in query:
size = int(query["size"][0])
else:
size = 500
query["size"] = size
parsed = parsed._replace(query=urlencode(query, doseq=True))
url = parsed.geturl()
r = session.get(url)
r.raise_for_status()
results = decode_results(r, file_format)
total = int(r.headers["x-total-results"])
print_progress_batches(0, size, total)
for i, batch in enumerate(get_batch(r, file_format)):
results = combine_batches(results, batch, file_format)
print_progress_batches(i + 1, size, total)
return results
def map_uniprot_identifiers(list_ids, from_id='UniProtKB_AC-ID', to_id='Gene_Name'):
mapping_dict = {}
try:
job_id = submit_id_mapping(from_db=from_id, to_db=to_id, ids=list_ids)
print(job_id)
if check_id_mapping_results_ready(job_id):
link = get_id_mapping_results_link(job_id)
print(link)
results = get_id_mapping_results_search(link)
results = pd.DataFrame(results['results'])
print(results)
mapping_dict = dict(zip(results['from'], results['to']))
print(mapping_dict)
except Exception as err:
print("the error is:")
print(err)
return mapping_dict
While the first function submit_id_mapping()
works and the line print(job_id)
yields a URL that can be copied and pasted into a browser to bring up a page with all the necessary information (e.g. https://rest.uniprot.org/idmapping/results/afcfe615e6a85f09c95a8734b708abca1cce78ce), but the results dataframe that results from get_id_mapping_results_search()
is completely empty, which returns an empty mapping_dict. Given that the relevant information is clearly available on the site, I don't know why this function is not working. I have checked the input and its formatting is properly set up. I have tried more debugging in the function get_id_mapping_results_search()
, but I can't seem to find where the problem is occurring. I have also implemented similar debugging approaches using a different input that did work in the past, but I also have been unable to pinpoint the issue. The only error I keep getting is that I have 'failedIds', which makes no sense, considering that the Uniprot IDs that my input provides does indeed have pertinent gene names in Uniprot (the URL I provided above is proof of this). So why can't I map the Uniprot IDs to their gene names?
Is this your code exactly?
return mapping_dict
in what you've shared.results = pd.DataFrame
but don't import pandas.check_id_mapping_results_ready
is not definedget_id_mapping_results_link
is not defineddecode_results
is not definedProvide examples of how you use your functions with a minimal working example code. And hopefully one that produces the issue.
Be up front and clear about the source of the code if you didn't draft it. Note in this post the first sentence points at the source code very similar to yours.
Example of MWE:
Code:
How to use the code
Bring up a temporary Jupyter session in your browser, by clicking here.
Paste in the code above in a new notebook.
After running the code above, run the following:
To see the dataframe, run in a cell
df
.To see the returned dictionary, run in a cell
the_returned_dict
.Easier alternative: use Unipressed, the Python package for querying UniProt's new REST API
Alternative option, using Unipressed, fully described at the top of this post:
To see the dataframe, run in a cell
results_df
.To see the raw list of id_mappings, run in a cell
results_list
.I have updated my code to fix indentation errors and post the missing code. I spent most of my time debugging the first code posted so I forgot to post other mentioned functions. But I will try out Unipressed tomorrow
please do not delete the post once you get an answer, it is considered a misuse of the site - the person that answered that a question would have not done so had they known you delete the post,