Ensembl-Ids Vs. Entrez-Ids
4
46
Entering edit mode
12.8 years ago
Untom ▴ 420

Hi there!

I'm not sure I understand the difference between the Ensembl-Gene-Database and the Entrez-database. I have two datasets that measure gene-expression. One uses Ensembl-IDs to identify the different genes, and the other one uses Entrez-IDs.

I understand that Ensembl and Entrez are both Gene-Databases and use different ID-Schemes. I've also heard that I can use e.g. biomart to convert from one ID to the other. What I was not able to determine was if the mapping was bijective. So here are my questions:

  • Does every Ensembl-Gene ID have a corresponding Entrez-ID? And if so, why weren't the two ever consolidated?

  • If not, what are the differences? Does one database contain more genes than the other? What are the scopes of the different databases?

  • What is the "standard" ID that people use when exchanging data?

  • What should I use in my further data processing? Should I convert the Entrez-IDs to Ensembl-IDs or vice versa?

ensembl entrez gene • 57k views
ADD COMMENT
5
Entering edit mode

Just to clarify: Entrez is not a gene database. It's the name of the NCBI infrastructure which provides access to all of the NCBI databases. One of those is the Gene database, so you would say "Entrez Gene".

ADD REPLY
1
Entering edit mode

Done, except for "bijective" => what's the typo there?

ADD REPLY
0
Entering edit mode

This is a very good question. Could you help clarify it a bit by fixing some of the typos: "bijective", "everye", "Ensemble"?

ADD REPLY
0
Entering edit mode

@Untom - I stand corrected, I never had head this word before. I assumed incorreclty it was a typo like the others. My apologies.

ADD REPLY
0
Entering edit mode

@Untom - I stand corrected, I never had heard this word before. I assumed incorreclty it was a typo like the others. My apologies.

ADD REPLY
0
Entering edit mode

@Untom - I stand corrected, I never had heard this word before. I assumed incorrectly it was a typo like the others. My apologies.

ADD REPLY
32
Entering edit mode
12.8 years ago
Miranda ▴ 340

Hi,

To answer some of your questions:

Unfortunately there is not necessarily an one-to-one mapping between Entrez Gene and Ensembl Gene IDs. Although it is improving. As you can read here: http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi they are working on consolidating them for human and mouse.

If may even differ per database you use to convert one ID to the other. So if you use the links of Entrez Gene to Ensembl this may give a different mapping than when you use the Ensembl Biomart for converting.

Personally, I would prefer Entrez Gene IDS as they are more stable IDs and more easily to map outdated IDs to current IDs. This is much harder for Ensembl Gene IDs.

Both could be considered standards. Another option is the HGNC symbol, which are more commonly used as the name for a gene.

I might have an e-mail from Ensembl or Entrez Gene that explains how they map their IDs to each other.

----------------------------------------------------------------------------------------------------------------------------------> I asked the following question to Ensembl:
Regarding the external references provided by Ensembl, I was wondering how the references to Entrez Gene are retrieved. Also on basis of the protein sequence?

Answer:
The Ensembl transcript or protein sequence is compared, using BLAST,against Entrezgene databases. In the case of a nucleotide sequence, it's the Ensembl cDNA that is compared.

I replied:
If I look at the external references for ENSG00000196176 (Homo sapiens), then I get 14 links to EntrezGene.

If I then go to Entrez Gene, then I only see for 8359 (HIST1H4A) the same reference to Ensembl. The other 13 refer to a different Ensembl identifier.

I know all these genes encode for the same protein, but I was assuming the nucleotide sequence is different for all 14 and that therefore ENSG00000196176 would only be linked to 8359 and not the 13 other ones.

The answer I got:
The external references need not be perfect matches. As the HIST1H4 records show a high degree of sequence similarity, the 14 will match to the Ensembl record, however not with a 100% id. These are just "close matches". 8359 is the best match and listed first.



This was back in 2009, but it does explain some of the discrepancies.

On the website of Entrez Gene they state the following for the file gene2ensembl they provide on their FTP site:

This file reports matches between NCBI and Ensembl annotation based on comparison of rna and protein features.

For all organisms, matches are collected as follows. For a protein to be identified as a match between RefSeq and Ensembl, there must be at least 80% overlap between the two. Furthermore, splice site matches must meet certain conditions: either 60% or more of the splice sites must match, or there may be at most one splice site mismatch.

For rna features, the matching criteria are the same as for proteins above. Furthermore, both the rna and the protein features must meet these minimum matching criteria to be considered a good match. In addition, only the best matches will be reported in this file. Other matches that satisified the matching criteria but were not the best matches will not be reported in this file.

<-----------------------------------------------------------------------------------

Hope it helps,

Gr Miranda

ADD COMMENT
1
Entering edit mode

Found it! See original answer. Good luck!

ADD REPLY
0
Entering edit mode

Thanks for the answer, this was really helpful. I would be very interested in said mail, if you manage to dig it up :)

ADD REPLY
15
Entering edit mode
12.8 years ago

Hello from the Ensembl Helpdesk. As Casey pointed out, you can ask specific questions on helpdesk@ensembl.org for direct answers from the team.

I'll first explain a bit about our gene set. The Ensembl protein coding gene and transcript set is based on the NCBI RefSeq set (manually curated entries only (NM and NP identifiers), not the predicted set (ie not XP and XM IDs)), along with UniProt proteins from Swiss-Prot and TrEMBL. Added to this is manual curation from the Havana group (at the Wellcome Trust Sanger Institute).

For documentation on how the gene set was determined, have a look here (including ncRNAs)

http://www.ensembl.org/info/docs/genebuild/index.html

As for differences between Ensembl and EntrezGene, it was already mentioned in this thread that the CCDS set was constructed to come up with a more unified gene set. Ensembl, UCSC, NCBI and Havana all take part in forming and agreeing on the consensus coding sequences in this set, which currently exists for human and mouse. The latest update, in Sept 2011, shows there are 26,473 CCDS IDs in Human corresponding to 18,471 gene IDs. (CCDS can be splice variants of one gene; ie more than one CCDS can be assigned to a gene).

As for matches between Ensembl and EntrezGene, we know that for the human Ensembl gene set, we have 21,184 links to EntrezGene. We try to get a perfect match when possible. Out of these 21,184 links, 504 genes have more than one EntrezGene entry associated with them. This occurs when we cannot choose a perfect match; ie when we have two good matches, but one does not appear to match with a better percentage than the other. In that case, we assign both matches to the gene/transcript.

We do match the EntrezGene to the Ensembl Gene ID through the CCDS, if that exists. I am not sure what EntrezGene reports as matches to Ensembl Genes. If you are going through 'LinkOut', those are Ensembl matches that NCBI imports from us directly.

I hope this helps. Feel free to ask more questions either on this thread, or on Ensembl helpdesk.

ADD COMMENT
10
Entering edit mode
12.8 years ago

You would need to pose this question to, or get a response from, the Ensembl help desk to fully answer all of your subquestion, but from recent work trying to link IDs between Entrez and Ensembl, I can help with some of these:

Does every Ensembl-Gene ID have a corresponding Entrez-ID?

No. As of release 56, Ensembl does not provide cross-references in the object_xref table between Entrez Gene ID and Ensembl Gene IDs for 26 of the 50 species in the main Ensembl DB (e.g. cow and chicken).

If not, what are the differences? What are the scopes of the different databases?

Ensembl gene IDs are restricted to the set of species in the Ensembl and Ensembl Genomes databases.

Does one database contain more genes than the other?

Almost certainly, you can do the calculations yourself. However, since there is an incomplete mapping of gene IDs between the two systems, it not possible to say have many genes are the same between the two databases without doing a mapping exercise at the sequence level.

What is the "standard" ID that people use when exchanging data?

There is no standard. Some people prefer using one system or the other, but these is no "industry standard"

What should I use in my further data processing? Should I convert the Entrez-IDs to Ensembl-IDs or vice versa?

It depends entirely on your problem and strategy for solving it.

ADD COMMENT
1
Entering edit mode
3.7 years ago
hagenaue ▴ 10

I have found these R packages to be quite helpful for mapping from one type of gene annotation to another:

org.Rn.eg.db: Genome wide annotation for Rat http://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html

org.mm.eg.db: Genome wide annotation for Mouse https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html

org.hs.eg.db: Genome wide annotation for Human https://www.bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html

ADD COMMENT

Login before adding your answer.

Traffic: 1737 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6