Question

ENSEMBL IDs 2 Entrez Gene IDs - what to do if no match?

0

Entering edit mode

7.2 years ago

lech.kaczmarczyk ▴ 50

Hi All, I have gene expression data with ENSEMBL Ids (ENSG00000XXXXXXX). I tried 3 different packages to convert them to ENTREZ IDs (bitr, biomatRt, AnnotationDb), but I consistently get no match for about 5-6% of the genes. I would like to do GO and GSEA, but most GO and GSEA tools require gene symbols or entrez IDs. This problem bugs me for a while already. How to handle this? I work with mouse genes.

Here are the example of what I am doing:

MyTargetList$entrez <- mapIds(org.Mm.eg.db,
                       keys=rownames(IP_toptreatRT3),
                       column ="ENTREZID",
                       keytype="ENSEMBL",
                       multiVals="first")

Or with biomaRt:

ensembl = useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
genemap <- getBM( attributes = c("ensembl_gene_id", "entrezgene"),
              mart = ensembl )

and then match function to populate the column.

But there seem to be gaps in the databases:

> head(genemap)
     ensembl_gene_id entrezgene
1 ENSMUSG00000064336         NA
2 ENSMUSG00000064337         NA
3 ENSMUSG00000064338         NA
4 ENSMUSG00000064339         NA
5 ENSMUSG00000064340         NA
6 ENSMUSG00000064341      17716

Cheers, Lech

annotation ensembl entrez biomart • 5.0k views

ADD COMMENT • link updated 7.2 years ago by Emily 24k • written 7.2 years ago by lech.kaczmarczyk ▴ 50

1

Entering edit mode

There's no way to map all Ensembl IDs to Entrez Gene IDs, the latter is a much smaller dataset than the former.

ADD REPLY • link 7.2 years ago by Devon Ryan 104k

0

Entering edit mode

It's probably useful if you add a few examples for which you can't find a match for us to replicate your issue.
In addition, showing, the code you used in one of those packages could allow us to spot a mistake.

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Hi, thanks for quick reply. I did add the examples. Since I get most of the records, I would assume it's just missing records in the database (see NAs after retriving biomaRt annotations).

ADD REPLY • link 7.2 years ago by lech.kaczmarczyk ▴ 50

0

Entering edit mode

Most of the NAs are mitochondrial genes.

ADD REPLY • link 7.2 years ago by cpad0112 21k

0

Entering edit mode

All of them are non-protein-coding.

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

not all as it seems:

> IP_toptreatRT0$entrez <- genemap$entrezgene[match(rownames(IP_toptreatRT0), genemap$ensembl_gene_id)]
> IP_toptreatRT0$biotype <- genemap$gene_biotype[match(rownames(IP_toptreatRT0), genemap$ensembl_gene_id)]
> IP_toptreatRT0[is.na(IP_toptreatRT0$entrez) & IP_toptreatRT0$biotype == "protein_coding",]
                        logFC     AveExpr         t      P.Value    adj.P.Val entrez        biotype
ENSMUSG00000068099  1.7856553  6.96331268 16.472108 1.231735e-15 1.097134e-13     NA protein_coding
ENSMUSG00000089665  2.6199978  2.29635921 14.398467 3.279678e-14 1.960328e-12     NA protein_coding
ENSMUSG00000029632  1.5256810  8.62947339 13.462191 1.632542e-13 7.790043e-12     NA protein_coding
ENSMUSG00000058927  1.5661194  7.71168598 12.056272 2.404491e-12 8.048267e-11     NA protein_coding
ENSMUSG00000103034  1.0460229  9.25763535 11.672475 4.484786e-12 1.396829e-10     NA protein_coding
ENSMUSG00000110358  2.8209744 -0.01374836 10.572951 4.100013e-11 9.610454e-10     NA protein_coding
ENSMUSG00000024571  0.8170688  6.72830264  9.971918 1.462597e-10 2.835528e-09     NA protein_coding
ENSMUSG00000091228  1.0110602  7.00300093  9.890613 1.743265e-10 3.272807e-09     NA protein_coding
ENSMUSG00000110086  1.8677990  1.83456194  9.111705 9.784476e-10 1.481520e-08     NA protein_coding
ENSMUSG00000087403 -1.6533448  5.04195971 -8.730238 2.344445e-09 3.166680e-08     NA protein_coding
ENSMUSG00000021708 -1.3807745  6.35653891 -8.448122 4.529814e-09 5.643085e-08     NA protein_coding

.................

ADD REPLY • link 7.2 years ago by lech.kaczmarczyk ▴ 50

score 0 · Answer 1 · 2017-10-02

It might help to retrieve HGNC names:

#!/usr/bin/env python

import sys
from mygene import MyGeneInfo

mg = MyGeneInfo()

genes = ["ENSMUSG00000064336",
         "ENSMUSG00000064337",
         "ENSMUSG00000064338",
         "ENSMUSG00000064339",
         "ENSMUSG00000064340",
         "ENSMUSG00000064341"]

sys.stdout.write("%s\t%s\t%s\n" % ("ensembl", "hgnc", "entrezgene"))
for gene in genes:
    result = mg.query(gene, fields=["symbol", "entrezgene"], species="mouse", verbose=False)
    for hit in result['hits']:
    if 'symbol' not in hit:
            hit['symbol'] = "NA"
        if 'entrezgene' not in hit:
            hit['entrezgene'] = "NA"
        sys.stdout.write("%s\t%s\t%s\n" % (gene, hit['symbol'], hit['entrezgene']))

Sample run:

$ ./convert_records.py
ensembl                 hgnc    entrezgene
ENSMUSG00000064336      mt-Tf   NA
ENSMUSG00000064337      mt-Rnr1 NA
ENSMUSG00000064338      mt-Tv   NA
ENSMUSG00000064339      mt-Rnr2 NA
ENSMUSG00000064340      mt-Tl1  NA
ENSMUSG00000064341      ND1     17716

Looking at HGNC names in GeneCards or other resources, for example, may help with searching for Entrez Gene records that may not be available directly through these sources.