How To Get Ensembl Id (Gene, Transcript, Protein) Mapping Information?
5
14
Entering edit mode
14.1 years ago
Unode ▴ 180

The goal is to map Ensembl identifiers (ENSG, ENST, ENSP) between each other in a programmatic way.

I've searched in the Ensembl website but I couldn't find straightforward instructions to achieve this.

I've also tried to use the public MySQL server, but I couldn't make sense of the schema and the API documentation is overwhelming and quite Perl centered.

Edit: If possible using the Ensembl.org website only as I would like to avoid having to worry about if I'm using the latest version or not.

ensembl identifiers mapping • 90k views
ADD COMMENT
4
Entering edit mode

Just to point out that the Ensembl site has a BioMart interface - http://www.ensembl.org/biomart/martview - and recommends it for data mining. You don't need to worry about versions; BioMart accesses the latest.

ADD REPLY
1
Entering edit mode

This question exposes the downsides of the complexity of the Ensembl schema. Ironically, it is easier to achieve this result via SQL from UCSC (see Pierre's solution) than from Ensembl (see Fred's solution) and justifies the existence of BioMart

ADD REPLY
1
Entering edit mode

On their FTP site, Ensembl.org provides tables (make sure to select TSV format in the "Other annotations" column for easily-parseable text flatfiles) that already have programmatically-mapped ENSG, ENST, and ENSP columns for all species: https://www.ensembl.org/info/data/ftp

This should do what you're looking for.

ADD REPLY
0
Entering edit mode

Ah that's great

ADD REPLY
34
Entering edit mode
14.1 years ago
Neilfws 49k

BioMart is very useful for mapping identifiers. You can use it at the website and there are also libraries for several languages; for example, biomaRt for R Bioconductor.

Here's an example. Say you have the human transcript ENST00000296026 and you want gene (ENSG) and protein (ENSP). You'd do the following at the BioMart website:

  1. Click MARTVIEW (top menu)
  2. Choose ENSEMBL GENES 59 (SANGER UK) for database and Homo sapiens genes GRCh37 for dataset
  3. Click "Filters" (left menu) and expand GENE
  4. Choose "Ensembl Transcript ID(s)" and paste your ID(s) or upload a file of IDs
  5. Click "Attributes" (left menu) and expand GENE
  6. Check Ensembl Gene ID, Transcript ID and Protein ID
  7. Click "Results" (top left menu)

This should return ENSG00000163734, ENST00000296026 and ENSP00000296026. Note that you can export results and map many kinds of IDs.

Here's the same thing using R and biomaRt:

library(biomaRt)
# define biomart object
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# query biomart
results <- getBM(attributes = c("ensembl_gene_id", "ensembl_transcript_id", "ensembl_peptide_id"),
                 filters = "ensembl_transcript_id", values = "ENST00000296026",
                 mart = mart)
results
#   ensembl_gene_id ensembl_transcript_id ensembl_peptide_id
# 1 ENSG00000163734       ENST00000296026    ENSP00000296026
ADD COMMENT
0
Entering edit mode

+1 for a simple solution. However it suffers from the fact that is using release 59 when release 60 is the current version in Ensembl.org. Edited question to reflect this point.

ADD REPLY
0
Entering edit mode

In addition, although this is not a requirement (using python), from a quick search in the web I couldn't find any decent interface to BioMart using Python. I can still use rpy but I would like to prevent the complexity of the task to grow beyond necessary.

ADD REPLY
0
Entering edit mode

Maybe it is too late but there is a python package for biomart

ADD REPLY
0
Entering edit mode

Use PyCogent! It isn't version specific and has a very simple API! I'd say the best available for Python coders!

ADD REPLY
0
Entering edit mode

Also, BioMart has EnsEMBL Genes 60 available!?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

@gawbul it was updated in the meantime, thanks

ADD REPLY
0
Entering edit mode

What happens if you remove "filters = ensembl_transcript_id"

ADD REPLY
0
Entering edit mode

If you remove filters, you will query the full dataset (all human genes, in the example above). This is not recommended for BioMart as large, genome-wide queries will overload the BioMart servers. For queries of this size, we suggest using the Ensembl REST API or FTP site.

**

ADD REPLY
6
Entering edit mode
14.1 years ago

You can also use the tables ensGene and ensGtp in the UCSC mysql server:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18

mysql> select * from ensGene as G,ensGtp as T
where G.name=T.transcript  and T.gene="ENSG00000215719" limit 1\G
*************************** 1. row ***************************
         bin: 585
        name: ENST00000369958
       chrom: chr1_random
      strand: -
     txStart: 35366
       txEnd: 37336
    cdsStart: 35366
      cdsEnd: 37336
   exonCount: 3
  exonStarts: 35366,36978,37308,
    exonEnds: 35537,37076,37336,
       score: 0
       name2: ENSG00000215719
cdsStartStat: cmpl
  cdsEndStat: incmpl
  exonFrames: 0,1,0,
        gene: ENSG00000215719
  transcript: ENST00000369958
     protein: ENSP00000358974
1 row in set (0.20 sec)
ADD COMMENT
6
Entering edit mode
9.6 years ago
Reece ▴ 310

Perhaps this Python solution will be useful to someone:

from biomart import BiomartServer

atts = ['external_gene_name','external_gene_source','ensembl_gene_id',
        'ensembl_transcript_id','ensembl_peptide_id']

server = BiomartServer("http://www.biomart.org/biomart")
hge = server.datasets['hsapiens_gene_ensembl']

print(server.databases['ensembl'])
s = hge.search({'attributes': atts}, header=1)
for l in s.iter_lines():
    print(l)

Gives:

ENSEMBL GENES 79 (SANGER UK)
Associated Gene Name    Associated Gene Source  Ensembl Gene ID Ensembl Transcript ID   Ensembl Protein ID
MT-TF   HGNC Symbol ENSG00000210049 ENST00000387314 
MT-RNR1 HGNC Symbol ENSG00000211459 ENST00000389680 
MT-TV   HGNC Symbol ENSG00000210077 ENST00000387342 
MT-RNR2 HGNC Symbol ENSG00000210082 ENST00000387347 
MT-TL1  HGNC Symbol ENSG00000209082 ENST00000386347 
MT-ND1  HGNC Symbol ENSG00000198888 ENST00000361390 ENSP00000354687
ADD COMMENT
4
Entering edit mode
14.1 years ago

You can get a nice explanation of the Ensembl Database Schema at the following url.

http://www.to.infn.it/ftbio/Ensembl_materials/Core/EnsSchema.ppt

Then you can get the information you are looking for using this SQL query : see below in a PHP script.

CODE

<?php
//Connect to the server
if (!$connectionServer = mysql_connect('ensembldb.ensembl.org:3306', 'anonymous', '')) die('Could not connect: ' . mysql_error());

//Connect to the database
if (!$database = mysql_select_db('homo_sapiens_core_47_36i', $connectionServer)) die('Could not select the database');

//Get the Ensembl IDs for Genes, Transcripts, and Proteins
$result = mysql_query("
SELECT DISTINCT 
gsi.stable_id as geneid, 
tsi.stable_id as transcriptid, 
tlsi.stable_id as translationid

FROM 
gene g, 
gene_stable_id gsi, 
transcript t, 
transcript_stable_id tsi, 
translation tl, 
translation_stable_id tlsi

WHERE g.gene_id = gsi.gene_id 
AND g.gene_id = t.gene_id 
AND t.transcript_id = tsi.transcript_id
AND t.transcript_id = tl.transcript_id
AND tl.translation_id = tlsi.translation_id

LIMIT 10");

if (!$result) {
    die('Invalid query: ' . mysql_error());
}
while ($row = mysql_fetch_assoc($result)) {
    print $row['geneid'] . " - " . $row['transcriptid'] . " - " . $row['translationid'] . "<br />";
}

mysql_free_result($result);
mysql_close($connectionServer);
?>

RESULT

ENSG00000146556 - ENST00000326632 - ENSP00000317668
ENSG00000197194 - ENST00000379481 - ENSP00000368794
ENSG00000197490 - ENST00000359752 - ENSP00000352790
ENSG00000215918 - ENST00000401099 - ENSP00000383878
ENSG00000177757 - ENST00000326734 - ENSP00000317958
ENSG00000188405 - ENST00000338633 - ENSP00000342867
ENSG00000187642 - ENST00000341290 - ENSP00000343864
ENSG00000215917 - ENST00000401098 - ENSP00000383877
ENSG00000215916 - ENST00000379325 - ENSP00000368629
ENSG00000205231 - ENST00000379317 - ENSP00000368621
ADD COMMENT
0
Entering edit mode

Your example is useful but the way you constructed the query will cause to only return complete triplets. If the transcript is a pseudogene like the ones corresponding to ENSG00000146556, you won't get them as a result (at least in homo_sapiens_core_60_37e). Thanks for the schema info though.

ADD REPLY
0
Entering edit mode

Thanks Fred, this saved a lot of time for me today.

ADD REPLY
3
Entering edit mode
14.1 years ago

I added a couple of comments, but thought I would clarity with an answer too.

You can use the PyCogent package available here http://pycogent.sourceforge.net/. It has the best API for EnsEMBL access in Python. PyGr has one too, but it doesn't seem to be maintained and only works with version 0.7.

Tutorials are available on EnsEMBL access using PyCogent here http://pycogent.sourceforge.net/examples/query_ensembl.html with cookbook examples here http://pycogent.sourceforge.net/cookbook/accessing_databases.html#ensemble, although the former is better.

BioMart access is available directly from EnsEMBL via the following link http://www.ensembl.org/biomart/martview, which has release 60 access.

Also, an additional (non-Python) library is CGL from the Yandell Lab here http://www.yandell-lab.org/software/cgl.html, written in Perl. It provides easy access to relationships between the different annotation features in a transparent manner.

ADD COMMENT

Login before adding your answer.

Traffic: 1713 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6