Question

Mapping Ensembl IDs to Entrez - Merge data frames

0

Entering edit mode

6.3 years ago

rin ▴ 40

Hi everyone

I am working on a gene expression data set from TCGA, where genes are annotated with Ensembl IDs. I used Biomart to convert them to Entrez by using

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))

genes <- getBM(
  filters="ensembl_gene_id_version",
  attributes=c("ensembl_gene_id", "entrezgene"),
  values=genesens,
  mart=mart)

But all I get is a list with the mapped IDs, while I want to add a column with Entrez to the corresponding Ensembl ID. Any ideas of how I should modify the above code?

Thank you in advance!

EDIT: Note that Ensembl in the initial data frame have dot suffix.

biomart RNA-Seq • 9.2k views

ADD COMMENT • link 6.3 years ago by rin ▴ 40

1

Entering edit mode

It sounds like you need to perform a merge with genes and your expression matrix. If the TCGA does not have a version number then you can remove it with gsub("\\.\\d+","", genes$ensembl_gene_id)

ADD REPLY • link 6.3 years ago by ejm32 ▴ 450

1

Entering edit mode

rina : You should take a look at @Mike Smith's answer here: A: Mapping Ensembl Gene IDs with dot suffix

ADD REPLY • link 6.3 years ago by GenoMax 147k

0

Entering edit mode

some example data from genesens object would help @ rina

ADD REPLY • link 6.3 years ago by cpad0112 21k

0

Entering edit mode

You are right.

Here are some ENSG00000000003.13 ENSG00000000005.5 ENSG00000000419.11 ENSG00000000457.12

ADD REPLY • link 6.3 years ago by rin ▴ 40

0

Entering edit mode

with example ids and OP code, following is the result:

> genes
  ensembl_gene_id entrezgene
1 ENSG00000000005      64102

Output ensembl gene IDs have no suffix. If you would like to merge the data frames (data data frame and results data frame) , you can merge them by ensembl_gene_id. If you could post few lines from dataframe and results (with few matching rows), that would be helpful.

If you want to add, gene symbol at the end, add 'hgnc_symbol' to the attribultes list.

> genes
  ensembl_gene_id entrezgene hgnc_symbol
1 ENSG00000000005      64102        TNMD

ADD REPLY • link 6.3 years ago by cpad0112 21k

0

Entering edit mode

Data frame´s first column has Ensembl IDs such as the following. Rest of the columns are raw counts of expression data

 [1] "ENSG00000000005.5"  "ENSG00000000419.11" "ENSG00000000457.12" "ENSG00000000460.15" "ENSG00000000938.11" "ENSG00000000971.14" "ENSG00000001036.12" "ENSG00000001084.9" 
[9] "ENSG00000001167.13"

The results I get after the mapping look like this.

ensembl_gene_id entrezgene
1 ENSG00000000005      64102
2 ENSG00000001561      22875
3 ENSG00000004478       2288
4 ENSG00000004799       5166
5 ENSG00000005022        292
6 ENSG00000005073       3207

ADD REPLY • link 6.3 years ago by rin ▴ 40

1

Entering edit mode

Well, there are ways to join the data frames using fuzzy logic or with some hacks. with some hacks (easy way): (note: genes is the list of ensembl example genes posted above and genesens is result from biomart)

> head(genes,3)
                  V1
1  ENSG00000000005.5
2 ENSG00000000419.11
3 ENSG00000000457.12
>library(stringr)
>genes$V2=str_split_fixed(genes$V1,"\\.",2)[,1]
>dplyr::left_join(genes, genesens, by=c("V2"="ensembl_gene_id"))
                  V1              V2 entrezgene
1  ENSG00000000005.5 ENSG00000000005      64102
2 ENSG00000000419.11 ENSG00000000419         NA
3 ENSG00000000457.12 ENSG00000000457         NA
4 ENSG00000000460.15 ENSG00000000460         NA
5 ENSG00000000938.11 ENSG00000000938         NA
6 ENSG00000000971.14 ENSG00000000971         NA
7 ENSG00000001036.12 ENSG00000001036         NA
8  ENSG00000001084.9 ENSG00000001084         NA
9 ENSG00000001167.13 ENSG00000001167         NA

With fuzzy logic, it would be:

>library(fuzzyjoin)
>regex_left_join(genes, genesens,by=c("V1"="ensembl_gene_id"))

                  V1 ensembl_gene_id entrezgene
1  ENSG00000000005.5 ENSG00000000005      64102
2 ENSG00000000419.11            <NA>         NA
3 ENSG00000000457.12            <NA>         NA
4 ENSG00000000460.15            <NA>         NA
5 ENSG00000000938.11            <NA>         NA
6 ENSG00000000971.14            <NA>         NA
7 ENSG00000001036.12            <NA>         NA
8  ENSG00000001084.9            <NA>         NA
9 ENSG00000001167.13            <NA>         NA

ADD REPLY • link 6.3 years ago by cpad0112 21k

0

Entering edit mode

Thank you so much for your help! The entrezgene is an integer and left join can only used to characters. Should I just convert it with the toString function? Excuse my very basic question, but I am just starting working with R.

ADD REPLY • link 6.3 years ago by rin ▴ 40

0

Entering edit mode

Can you print the data structure of common columns between the two frames?

ADD REPLY • link 6.3 years ago by cpad0112 21k

0

Entering edit mode

Expression matrix columns

                                   X1 TCGA-AA-3815-01A-01R-1022-07 TCGA-NH-A5IV-01A-42R-A37K-07
ENSG00000000003.13 ENSG00000000003.13                         2449                         4369
ENSG00000000005.5   ENSG00000000005.5                            6                           58
ENSG00000000419.11 ENSG00000000419.11                          487                         1168
ENSG00000000457.12 ENSG00000000457.12                          269                         1049
ENSG00000000460.15 ENSG00000000460.15                          177                          533
ENSG00000000938.11 ENSG00000000938.11                          331                          858

that I turned into

"ENSG00000000003"       ENSG00000000005"        "ENSG00000000419"        "ENSG00000000457"        "ENSG00000000460"        "ENSG00000000938"        "ENSG00000000971"

by using nth(tstrsplit(genes, split ="\\."),n=1)

Biomart result is the following matrix

ensembl_gene_id entrezgene
1 ENSG00000000003       7105
2 ENSG00000000005      64102
3 ENSG00000000419       8813
4 ENSG00000000457      57147
5 ENSG00000000460      55732
6 ENSG00000000938       2268

Everything column is "character" except entrezgene that is an integer.

ADD REPLY • link 6.3 years ago by rin ▴ 40

0

Entering edit mode

Then your merge is on ensembl_gene_id column (from the result) and x1 column from the data matrix. Entrezgene column str doesn't affect left_join

ADD REPLY • link 6.3 years ago by cpad0112 21k

0

Entering edit mode

This is the reason I am confused when I get this message

Error in UseMethod("groups") : 
  no applicable method for 'groups' applied to an object of class "character"

And as the entrezgene column is the only one not being character I assumed this was the problem.

ADD REPLY • link 6.3 years ago by rin ▴ 40

1

Entering edit mode

Input head:

> head(dat)
                             X1 TCGA.AA.3815.01A.01R.1022.07 TCGA.NH.A5IV.01A.42R.A37K.07
ENSG00000000003 ENSG00000000003                         2449                         4369
ENSG00000000005 ENSG00000000005                            6                           58
ENSG00000000419 ENSG00000000419                          487                         1168
ENSG00000000457 ENSG00000000457                          269                         1049
ENSG00000000460 ENSG00000000460                          177                          533
ENSG00000000938 ENSG00000000938                          331                          858

results head:

> head(results)
  ensembl_gene_id entrezgene
1 ENSG00000000003       7105
2 ENSG00000000005      64102
3 ENSG00000000419       8813
4 ENSG00000000457      57147
5 ENSG00000000460      55732
6 ENSG00000000938       2268

data structure of results (ncbi entries are integers)

> str(results)
'data.frame':   6 obs. of  2 variables:
 $ ensembl_gene_id: chr  "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
 $ entrezgene     : int  7105 64102 8813 57147 55732 2268

output:

> dplyr::left_join(dat,results,by=c("X1"="ensembl_gene_id"))
               X1 TCGA.AA.3815.01A.01R.1022.07 TCGA.NH.A5IV.01A.42R.A37K.07 entrezgene
1 ENSG00000000003                         2449                         4369       7105
2 ENSG00000000005                            6                           58      64102
3 ENSG00000000419                          487                         1168       8813
4 ENSG00000000457                          269                         1049      57147
5 ENSG00000000460                          177                          533      55732
6 ENSG00000000938                          331                          858       2268

check if there are conflicting packages with dplyr (among loaded packages) and also check the structure of common columns. For eg. str(dat$X1) and str(results$ensembl_gene_id) from the above example. Both must match.

ADD REPLY • link 6.3 years ago by cpad0112 21k

0

Entering edit mode

I was mistakenly putting the Ensembl ID column instead of the whole data frame as an argument to the left join function. It worked just fine now. Thanks for helping!

ADD REPLY • link 6.3 years ago by rin ▴ 40

0

Entering edit mode

rin : If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they all work.
Upvote|Bookmark|Accept

Note: I have moved @cpad0112's original comment to an answer to maintain the train of throught.

ADD REPLY • link 4.8 years ago by GenoMax 147k

0

Entering edit mode

Looking at the NAs that came up after mapping to entrez, I randomly checked one (ENSG00000018607) and it is linked to an Entrez ID that was yet not found. Any ideas what might be the reason?

This is the code I used

   mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
    genes.entrez <- getBM(
      filters="ensembl_gene_id",
      attributes=c("ensembl_gene_id", "entrezgene"),
      values=genes.nodot,
      mart=mart)

ADD REPLY • link 6.3 years ago by rin ▴ 40

score 3 · Answer 1 · 2018-08-01

3

Entering edit mode

6.3 years ago

arta ▴ 670

Try this.

source("https://bioconductor.org/biocLite.R")
biocLite("org.Hs.eg.db")
biocLite("clusterProfiler")
library(clusterProfiler)
library(org.Hs.eg.db)
gene.df <- bitr(gene.list, fromType = "ENSEMBL",
                        toType = c( "ENTREZID", "SYMBOL"),
                        OrgDb = org.Hs.eg.db)

ADD COMMENT • link 6.3 years ago by arta ▴ 670

1

Entering edit mode

Running the code returns this message

select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(genesens2, fromType = "ENSEMBL", toType = c("ENTREZID",  :
  57.95% of input gene IDs are fail to map...

I know that not all IDs can be mapped, but is such a high percentage normal? In addition to that, I am still unsure how everything can be included right into the data frame, as I have to use the expression matrix with the Entrez IDs further. Especially taking into account that not all the IDs will be mapped, I am not able to just add a column.

ADD REPLY • link 6.3 years ago by rin ▴ 40

0

Entering edit mode

Thanks for your help! Where is the bitr function from? It is not recognised from any of my installed packages.

ADD REPLY • link 6.3 years ago by rin ▴ 40

0

Entering edit mode

I have updated the code, i forgot to add the other package.

ADD REPLY • link 6.3 years ago by arta ▴ 670

0

Entering edit mode

Could you elaborate please? Which package is bitr defined in. Does it deal with the gene-builds appropriately?

ADD REPLY • link 6.3 years ago by russhh 5.7k

1

Entering edit mode

https://www.rdocumentation.org/packages/clusterProfiler/versions/3.0.4/topics/bitr

ADD REPLY • link 6.3 years ago by cpad0112 21k