Mapping Ensembl Gene IDs with dot suffix
4
10
Entering edit mode
6.9 years ago
mk ▴ 310

I have a bunch of bulk mRNA sequencing pulled off of the TCGA. Feature names appear to be Ensembl gene IDs with a suffix. Here is an example:

[995] "ENSG00000236246.1" "ENSG00000281088.1" [997] "ENSG00000254526.1" "ENSG00000223575.2" [999] "ENSG00000201444.1" "ENSG00000232573.1"

I am taking the intersection between these features and a set of Entrez Gene IDs. In order to do this I am using the biomaRt package to generate a mapping between Ensembl gene IDs and Entrez gene IDs. However, the only Entrez gene IDs I can find lack the suffixes. Here is the head of the table that maps Entrez genes to Ensemble genes:

  entrezgene ensembl_gene_id
1      90529 ENSG00000001460
2       9235 ENSG00000008517
3      10747 ENSG00000009724
4     654364 ENSG00000011052
5     112611 ENSG00000013392
6      57210 ENSG00000022567

Can someone explain what the Ensembl suffixes mean and how to convert these names to Entrez? If this can be done with biomaRt, it would be ideal. Thanks.

ensembl gene biomart bioconductor R • 17k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
18
Entering edit mode
6.9 years ago
Emily 24k

The numbers are version numbers. There is information about stable ID versioning here. You can just strip off the version numbers to use with biomaRt.

ADD COMMENT
0
Entering edit mode

Note that you may wind up with duplicates e.g. ENSG00000228572.7 and ENSG00000228572.7_PAR_Y, but even with attribute ensembl_gene_id_version biomart doesn't include these "PAR" IDs

ADD REPLY
8
Entering edit mode
6.9 years ago
Mike Smith ★ 2.1k

Here's an example of doing the conversion using biomaRt. You can use the versioned IDs you've got, but you'll see it's better the remove the version numbers.

First, we'll load biomaRt and use your example IDs.

library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

gene_ids_version <- c("ENSG00000236246.1",
                      "ENSG00000281088.1",
                      "ENSG00000254526.1",
                      "ENSG00000223575.2",
                      "ENSG00000201444.1",
                      "ENSG00000232573.1")

Now we can query BioMart, specifying that we want to use the versioned Ensembl Gene IDs by using the following:

getBM(attributes = c('ensembl_gene_id_version',
                     'entrezgene'),
      filters = 'ensembl_gene_id_version', 
      values = gene_ids_version,
      mart = mart)

> 
  ensembl_gene_id_version entrezgene
1       ENSG00000201444.1         NA
2       ENSG00000223575.2         NA
3       ENSG00000232573.1         NA
4       ENSG00000254526.1         NA

However, notice that we only get 4 results returned from our 6 IDs. This is because if you query using a version number, but it isn't the most recent version, it doesn't return a result - not really ideal. Better to do as Emily suggests, and strip the version number to use just the Ensembl gene ID. We'll use the stringr package to do that here:

library(stringr)
gene_ids <- str_replace(gene_ids_version,
                        pattern = ".[0-9]+$",
                        replacement = "")

Now rerun the query with the trimmed IDs and you'll get 5 results this time:

getBM(attributes = c('ensembl_gene_id',
                     'entrezgene'),
      filters = 'ensembl_gene_id', 
      values = gene_ids,
      mart = mart)

>
  ensembl_gene_id entrezgene
1 ENSG00000201444         NA
2 ENSG00000223575         NA
3 ENSG00000232573         NA
4 ENSG00000236246         NA
5 ENSG00000254526         NA

The completely missing entry is because that gene, ENSG00000281088, has been retired from Ensembl, so you'll never get a result. The NA values for the rest are because there's no mapping between Ensembl and Entrez for those genes.

Just to check it's really working we'll demonstrate with some IDs that can be mapped.

getBM(attributes = c('ensembl_gene_id',
                     'entrezgene'),
      filters = 'ensembl_gene_id', 
      values = c('ENSG00000001460', 'ENSG00000008517', 'ENSG00000009724'),
      mart = mart)

>
  ensembl_gene_id entrezgene
1 ENSG00000001460      90529
2 ENSG00000008517       9235
3 ENSG00000009724      10747
ADD COMMENT
3
Entering edit mode
6.7 years ago
PavolG ▴ 30

My favorite version to strip the versions. Used dplyr and data.table functions nth() and tstsplit() respectively.

nth(tstrsplit(gene_ids_version, split ="\\."),n=1)
ADD COMMENT
1
Entering edit mode
6.9 years ago

Something like this ? In R console :

data <- c("ENSG00000236246.1","ENSG00000281088.1","ENSG00000254526.1","ENSG00000223575.2","ENSG00000201444.1","ENSG00000232573.1")
data_modified <- sapply(strsplit(data,"\\."), function(x) x[1])
ADD COMMENT

Login before adding your answer.

Traffic: 3513 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6