Question

Mapping Ensembl Gene IDs with dot suffix

10

Entering edit mode

6.8 years ago

mk ▴ 310

I have a bunch of bulk mRNA sequencing pulled off of the TCGA. Feature names appear to be Ensembl gene IDs with a suffix. Here is an example:

[995] "ENSG00000236246.1" "ENSG00000281088.1" [997] "ENSG00000254526.1" "ENSG00000223575.2" [999] "ENSG00000201444.1" "ENSG00000232573.1"

I am taking the intersection between these features and a set of Entrez Gene IDs. In order to do this I am using the biomaRt package to generate a mapping between Ensembl gene IDs and Entrez gene IDs. However, the only Entrez gene IDs I can find lack the suffixes. Here is the head of the table that maps Entrez genes to Ensemble genes:

  entrezgene ensembl_gene_id
1      90529 ENSG00000001460
2       9235 ENSG00000008517
3      10747 ENSG00000009724
4     654364 ENSG00000011052
5     112611 ENSG00000013392
6      57210 ENSG00000022567

Can someone explain what the Ensembl suffixes mean and how to convert these names to Entrez? If this can be done with biomaRt, it would be ideal. Thanks.

ensembl gene biomart bioconductor R • 17k views

ADD COMMENT • link updated 17 months ago by zx8754 12k • written 6.8 years ago by mk ▴ 310

0

Entering edit mode

Related post at SO:

ADD REPLY • link 17 months ago by zx8754 12k

8

Entering edit mode

6.8 years ago

Mike Smith ★ 2.1k

Here's an example of doing the conversion using biomaRt. You can use the versioned IDs you've got, but you'll see it's better the remove the version numbers.

First, we'll load biomaRt and use your example IDs.

library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

gene_ids_version <- c("ENSG00000236246.1",
                      "ENSG00000281088.1",
                      "ENSG00000254526.1",
                      "ENSG00000223575.2",
                      "ENSG00000201444.1",
                      "ENSG00000232573.1")

Now we can query BioMart, specifying that we want to use the versioned Ensembl Gene IDs by using the following:

getBM(attributes = c('ensembl_gene_id_version',
                     'entrezgene'),
      filters = 'ensembl_gene_id_version', 
      values = gene_ids_version,
      mart = mart)

> 
  ensembl_gene_id_version entrezgene
1       ENSG00000201444.1         NA
2       ENSG00000223575.2         NA
3       ENSG00000232573.1         NA
4       ENSG00000254526.1         NA

However, notice that we only get 4 results returned from our 6 IDs. This is because if you query using a version number, but it isn't the most recent version, it doesn't return a result - not really ideal. Better to do as Emily suggests, and strip the version number to use just the Ensembl gene ID. We'll use the stringr package to do that here:

library(stringr)
gene_ids <- str_replace(gene_ids_version,
                        pattern = ".[0-9]+$",
                        replacement = "")

Now rerun the query with the trimmed IDs and you'll get 5 results this time:

getBM(attributes = c('ensembl_gene_id',
                     'entrezgene'),
      filters = 'ensembl_gene_id', 
      values = gene_ids,
      mart = mart)

>
  ensembl_gene_id entrezgene
1 ENSG00000201444         NA
2 ENSG00000223575         NA
3 ENSG00000232573         NA
4 ENSG00000236246         NA
5 ENSG00000254526         NA

The completely missing entry is because that gene, ENSG00000281088, has been retired from Ensembl, so you'll never get a result. The NA values for the rest are because there's no mapping between Ensembl and Entrez for those genes.

Just to check it's really working we'll demonstrate with some IDs that can be mapped.

getBM(attributes = c('ensembl_gene_id',
                     'entrezgene'),
      filters = 'ensembl_gene_id', 
      values = c('ENSG00000001460', 'ENSG00000008517', 'ENSG00000009724'),
      mart = mart)

>
  ensembl_gene_id entrezgene
1 ENSG00000001460      90529
2 ENSG00000008517       9235
3 ENSG00000009724      10747

ADD COMMENT • link 6.8 years ago by Mike Smith ★ 2.1k

3

Entering edit mode

6.6 years ago

PavolG ▴ 30

My favorite version to strip the versions. Used dplyr and data.table functions nth() and tstsplit() respectively.

nth(tstrsplit(gene_ids_version, split ="\\."),n=1)

ADD COMMENT • link 6.6 years ago by PavolG ▴ 30

1

Entering edit mode

6.8 years ago

Bastien Hervé 6.0k

Something like this ? In R console :

data <- c("ENSG00000236246.1","ENSG00000281088.1","ENSG00000254526.1","ENSG00000223575.2","ENSG00000201444.1","ENSG00000232573.1")
data_modified <- sapply(strsplit(data,"\\."), function(x) x[1])

ADD COMMENT • link 6.8 years ago by Bastien Hervé 6.0k

score 18 · Accepted Answer · 2018-03-07

18

Entering edit mode

6.8 years ago

Emily 24k

The numbers are version numbers. There is information about stable ID versioning here. You can just strip off the version numbers to use with biomaRt.

ADD COMMENT • link 6.8 years ago by Emily 24k

0

Entering edit mode

Note that you may wind up with duplicates e.g. ENSG00000228572.7 and ENSG00000228572.7_PAR_Y, but even with attribute ensembl_gene_id_version biomart doesn't include these "PAR" IDs

ADD REPLY • link 17 months ago by LayneSadler ▴ 90