I'm making Ensembl requests via two methods in my script: 1) using the biomart Python package to query 200 IDs at a time for gene symbol conversion, and 2) using the REST API to get Ensembl Homology information from each ID individually.
Earlier today, my requests were going through pretty quick. Now they take ages and hang for what seems like forever. I've fiddled with the code a little since then, but I can't see anything I did that would've had this effect - is this an Ensembl problem on the server side? Can I fix it somehow?
This is the first excerpt of the code, which converts Ensembl peptide IDs to their matching gene IDs. It was very fast and is now very slow:
import numpy as np
import pandas as pd
import time
import requests
from biomart import BiomartServer
#data_df is a dataframe obtained elsewhere; it contains gene IDs and other information
def dataset_search_chunks(id_list, id_type, dataset, chunk_size, attributes = []):
chunk_start_index = 0
no_more_chunks = False
collected_response_df = pd.DataFrame(columns = attributes)
while not no_more_chunks:
chunk_end_index = chunk_start_index + chunk_size
if chunk_end_index > len(source_id_list):
id_chunk = id_list[chunk_start_index:]
no_more_chunks = True
else:
id_chunk = id_list[chunk_start_index:chunk_end_index]
if len(id_chunk) == 0:
break
filters_dict = {id_type:id_chunk}
chunk_response = dataset.search({"filters":filters_dict, "attributes":attributes})
for i, line in enumerate(chunk_response.iter_lines()):
line = line.decode().split("\t")
collected_response_df.loc[len(collected_response_df)] = line
chunk_start_index += chunk_size
return collected_response_df
source_biomart_host = http://www.ensembl.org/biomart
source_biomart_server = BiomartServer(source_biomart_host)
dataset_name = "hsapiens_gene_ensembl"
source_biomart_dataset = source_biomart_server.datasets[dataset_name]
source_id_list = data_df[ensembl_peptide_col].tolist()
source_gene_df = dataset_search_chunks(id_list = source_id_list, id_type = "ensembl_peptide_id", dataset = source_biomart_dataset,
chunk_size = 200, attributes = ["ensembl_peptide_id", "ensembl_gene_id"])
#Do stuff with source_gene_df
Here is the second excerpt, where I'm querying the Ensembl REST API for homology information, and it became ultra slow - used to take about 1 second per ID, and now it takes on the order of several minutes each.
gene_id_list = data_df[ensembl_gene_col].tolist()
gene_id_list = list(dict.fromkeys(gene_id_list)) #removes duplicates
rest_server = "https://rest.ensembl.org"
rest_ext = "/homology/id/"
rest_suffix = "?"
gene_homologies_dict = {}
for i, gene_id in enumerate(gene_id_list):
if gene_id != "None":
query_url = rest_server + rest_ext + gene_id + rest_suffix
response = requests.get(query_url, headers = {"Content-Type" : "application/json"})
if not response.ok:
response.raise_for_status()
decoded = response.json()
decoded_data = decoded.get("data")
if len(decoded_data) == 0:
decoded_data = {}
homologies = []
elif len(decoded_data) == 1:
decoded_data = decoded_data[0]
homologies = decoded_data.get("homologies")
else:
raise Exception("List containing decoded data must not be longer than 1")
gene_homologies_dict[gene_id] = homologies
time.sleep(0.2) #My hope was to prevent throttling by adding a delay... no luck
The requests.get() method is giving me the status code 200 (OK) when I invoke response.status_code and slowly iterates over my list of IDs for about 25 items... and then on the 26th item it gives the following. I've included a few previous lines for reference.
That ID does have homologs available: https://usa.ensembl.org/Homo_sapiens/Gene/Compara_Ortholog?db=core;g=ENSG00000185104;r=1:50437028-50960267
It shows as error 503: "service unavailable" when I click that link... also, here's a strange thing, I just tried from another internet connection in another building (i.e. new IP address too), and no change. What does it mean?
Perhaps there is an error in the compara database for that entry. Can you remove that and see if your job proceeds?
No need, that can't be the problem because it doesn't always die at that particular ID - sometimes it dies after only a couple! It never gets past about 35 though, and each one is very, very slow...