API for retrieving genes' sequences from genes' symbols in a single call
1
0
Entering edit mode
3.0 years ago
langziv ▴ 70

I asked a similar question before but couldn't achieve the desired solution, so I'll try to be more specific this time.
I managed to use Ensembl's API to get the sequences but in two steps that require calling the API twice for every genes' symbols list: one call to
https://rest.ensembl.org/lookup/symbol/homo_sapiens
from which response I get the genes' IDs, and another to
https://rest.ensembl.org/sequence/id/
to get the sequences for every ID.
I'm looking for a way to do this in a single call (hoping that it's possible). It can be in Linux or in a programming language code (I'm using python).

The answer I was given for my first question was

"You need to use the lookup/symbol endpoint in the Ensembl REST API with the sequence/id endpoint."

I'm not sure I understand it and couldn't get the author's clarification, so eventually I decided to try to get some more answers.

Thanks.

gene-sequence linux API gene-symbol • 2.0k views
ADD COMMENT
0
Entering edit mode

That question was answered by someone who worked at Ensembl. Using Ensembl REST API appears to need two calls. You may be able to do this using biomaRt but that will require using R.

I had given you a solution using a single command line (not for Ensembl, since there is no equivalent there) that you seem to have ignored: Getting genes' sequences by querying gene symbols/names

ADD REPLY
0
Entering edit mode

Thank you for the reply. I'll answer you in the first question's page.

ADD REPLY
0
Entering edit mode
3.0 years ago

Here's a bash one-liner, which can be expanded to a list of ENSG symbols or ported to other languages:

$ genome="hg38"
$ ensembl_gene="ENSG00000123374"
$ IFS=$'\n' read -rd '' chrom start stop < <(wget -qO- "https://mygene.info/v3/query?q=ensembl.gene:${ensembl_gene}&fields=exons" | jq -r '(.hits[0].exons[0].chr, .hits[0].exons[0].cdsstart, .hits[0].exons[0].cdsend)'); wget -qO- "https://api.genome.ucsc.edu/getData/sequence?genome=${genome};chrom=chr${chrom};start=${start};end=${stop}" | jq -r .dna;
ATGGAGAACTTCC...CTGA

If you want one single API endpoint to do the same, you may need to contact the Ensembl or UCSC developers and put in a feature request.

ADD COMMENT
0
Entering edit mode

Thanks, but I need to get sequences from symbols, not IDs. Also, is this for exons? I'm working with sequences of genes at the genomic level (before introns removal).

ADD REPLY
0
Entering edit mode

This queries for the first CDS start and end, i.e. protein coding sequence. You might want a different isoform, however. Take a look at the JSON or mygene.info API for more detail, or consider using a different endpoint.

ADD REPLY

Login before adding your answer.

Traffic: 2619 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6