Question

My ensembl requests are suddenly very slow - why?

0

Entering edit mode

2.7 years ago

ngarber ▴ 60

I'm making Ensembl requests via two methods in my script: 1) using the biomart Python package to query 200 IDs at a time for gene symbol conversion, and 2) using the REST API to get Ensembl Homology information from each ID individually.

Earlier today, my requests were going through pretty quick. Now they take ages and hang for what seems like forever. I've fiddled with the code a little since then, but I can't see anything I did that would've had this effect - is this an Ensembl problem on the server side? Can I fix it somehow?

This is the first excerpt of the code, which converts Ensembl peptide IDs to their matching gene IDs. It was very fast and is now very slow:

import numpy as np
import pandas as pd
import time
import requests
from biomart import BiomartServer

#data_df is a dataframe obtained elsewhere; it contains gene IDs and other information

def dataset_search_chunks(id_list, id_type, dataset, chunk_size, attributes = []): 
    chunk_start_index = 0
    no_more_chunks = False
    collected_response_df = pd.DataFrame(columns = attributes)

    while not no_more_chunks: 
        chunk_end_index = chunk_start_index + chunk_size
        if chunk_end_index > len(source_id_list): 
            id_chunk = id_list[chunk_start_index:]
            no_more_chunks = True
        else: 
            id_chunk = id_list[chunk_start_index:chunk_end_index]

        if len(id_chunk) == 0: 
            break

        filters_dict = {id_type:id_chunk}
        chunk_response = dataset.search({"filters":filters_dict, "attributes":attributes})

        for i, line in enumerate(chunk_response.iter_lines()): 
            line = line.decode().split("\t")
            collected_response_df.loc[len(collected_response_df)] = line

        chunk_start_index += chunk_size
    return collected_response_df

source_biomart_host = http://www.ensembl.org/biomart
source_biomart_server = BiomartServer(source_biomart_host)
dataset_name = "hsapiens_gene_ensembl"
source_biomart_dataset = source_biomart_server.datasets[dataset_name]
source_id_list = data_df[ensembl_peptide_col].tolist()

source_gene_df = dataset_search_chunks(id_list = source_id_list, id_type = "ensembl_peptide_id", dataset = source_biomart_dataset, 
        chunk_size = 200, attributes = ["ensembl_peptide_id", "ensembl_gene_id"])
#Do stuff with source_gene_df

Here is the second excerpt, where I'm querying the Ensembl REST API for homology information, and it became ultra slow - used to take about 1 second per ID, and now it takes on the order of several minutes each.

gene_id_list = data_df[ensembl_gene_col].tolist()
gene_id_list = list(dict.fromkeys(gene_id_list)) #removes duplicates

rest_server = "https://rest.ensembl.org"
rest_ext = "/homology/id/"
rest_suffix = "?"

gene_homologies_dict = {}
for i, gene_id in enumerate(gene_id_list):
    if gene_id != "None": 
        query_url = rest_server + rest_ext + gene_id + rest_suffix
        response = requests.get(query_url, headers = {"Content-Type" : "application/json"})
        if not response.ok: 
            response.raise_for_status()
        decoded = response.json()

        decoded_data = decoded.get("data")
        if len(decoded_data) == 0: 
            decoded_data = {}
            homologies = []
        elif len(decoded_data) == 1: 
            decoded_data = decoded_data[0]
            homologies = decoded_data.get("homologies")
        else: 
            raise Exception("List containing decoded data must not be longer than 1")

        gene_homologies_dict[gene_id] = homologies
        time.sleep(0.2) #My hope was to prevent throttling by adding a delay... no luck

requests biomart Python REST Ensembl • 4.2k views

ADD COMMENT • link updated 2.6 years ago by Ben Moore ★ 2.4k • written 2.7 years ago by ngarber ▴ 60

score 0 · Answer 1 · 2022-09-29

0

Entering edit mode

2.7 years ago

GenoMax 151k

This question is probably not easily answerable. It is possible that Ensembl server is simply overloaded at the moment and you are feeling that effect. On other hand if you have been hammering the server all day long with requests then the monitoring system may be paying attention to your requests and slowing things down.

That said check if you are getting one of the following HTTP codes (from Ensembl API course slides) :

403 Forbidden 
You are submitting far too many requests and have been
    temporarily forbidden access to the service. Wait and retry with a
    maximum of 15 requests per second.

429 Too Many Requests
You have been rate-limited; wait and retry. The headers
X-RateLimit-Reset, X-RateLimit-Limit and X-RateLimit-Remaining will
inform you of how long you have until your limit is reset and what
that limit was. If you get this response and have not exceeded
your limit then check if you have made too many requests per
second.

ADD COMMENT • link 2.7 years ago by GenoMax 151k

0

Entering edit mode

The requests.get() method is giving me the status code 200 (OK) when I invoke response.status_code and slowly iterates over my list of IDs for about 25 items... and then on the 26th item it gives the following. I've included a few previous lines for reference.

Retrieving homology data for ENSG00000262102 (25 of 622)
Request ok. Status code: 200
    ... retrieved! Data length: 1
Retrieving homology data for ENSG00000185104 (26 of 622)
Traceback (most recent call last):
  File "/home/tiltwolf/Documents/GitHub/PACM/Step7_SLiM_Conservation.py", line 208, in <module>
    if not response.ok: 
  File "/usr/lib/python3.10/site-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://rest.ensembl.org/homology/id/ENSG00000185104

ADD REPLY • link 2.7 years ago by ngarber ▴ 60

0

Entering edit mode

That ID does have homologs available: https://usa.ensembl.org/Homo_sapiens/Gene/Compara_Ortholog?db=core;g=ENSG00000185104;r=1:50437028-50960267

ADD REPLY • link 2.7 years ago by GenoMax 151k

0

Entering edit mode

It shows as error 503: "service unavailable" when I click that link... also, here's a strange thing, I just tried from another internet connection in another building (i.e. new IP address too), and no change. What does it mean?

ADD REPLY • link 2.7 years ago by ngarber ▴ 60

0

Entering edit mode

Perhaps there is an error in the compara database for that entry. Can you remove that and see if your job proceeds?

ADD REPLY • link 2.7 years ago by GenoMax 151k

0

Entering edit mode

No need, that can't be the problem because it doesn't always die at that particular ID - sometimes it dies after only a couple! It never gets past about 35 though, and each one is very, very slow...

ADD REPLY • link 2.7 years ago by ngarber ▴ 60

score 0 · Answer 2 · 2022-09-30

0

Entering edit mode

2.7 years ago

Ben Moore ★ 2.4k

Hi ngarber,

We have experienced some issues with the Ensembl REST API today due to high load on our servers, which is why you may have experienced slow response times. With large queries on the REST API we suggest adding rate limits to avoid exceeding the allowed number of requests: https://github.com/Ensembl/ensembl-rest/wiki/Rate-Limits

ADD COMMENT • link 2.7 years ago by Ben Moore ★ 2.4k

0

Entering edit mode

Hi Ben,

I am unfortunately dealing with the internal server error (500) and it stops either at the beginning of my batch query or in the middle. It made it to the end twice. I am using R. Any suggestions?

Best, Arby

ADD REPLY • link 2.7 years ago by aa9gj ▴ 10

0

Entering edit mode

Hi Arby,

We are currently experiencing some issues with the Ensembl REST API due to high loads on our servers, which is why you may have experienced slow response times. We are working to fix this as quickly as possible. Thank you for your patience!

ADD REPLY • link 2.7 years ago by Ben Moore ★ 2.4k

0

Entering edit mode

Hi Arby,

The issues affecting the Ensembl REST API have now been resolved.

ADD REPLY • link 2.6 years ago by Ben Moore ★ 2.4k