Question

How To Measure Gene Enrichement (Go, Kegg) Using R And Ensembl Ids ?

7

Entering edit mode

13.4 years ago

Radhouane Aniba ▴ 790

Hi,

What tool do you advice me to use (R package ?) to measure gene set enrichment in GO/KEGG BUT having a list of Ensembl IDS not Entrez IDs

Most of the tools listed are using Entrez Ids and I would like also to know the reason to that ?

Regards

Rad

enrichment pathway identifiers • 19k views

ADD COMMENT • link updated 13.4 years ago by Michael Kuhn 5.0k • written 13.4 years ago by Radhouane Aniba ▴ 790

2

Entering edit mode

Is there any particular reason to use only Ensembl IDs ? I think you if you can use an ID converter to map between Ensembl <-> Entrez you may get more tools to do this type of analysis.

ADD REPLY • link 13.4 years ago by Khader Shameer 18k

0

Entering edit mode

Sometimes mapping Ensembl IDs to Entrez IDs let's say using Biomart makes you loose some entries since this is not a 1-1 relationship, some Ensembl IDs do not have necessary Entrez equivalents

ADD REPLY • link 13.4 years ago by Radhouane Aniba ▴ 790

score 9 · Answer 1 · 2012-02-27

It looks like things have changed in the meantime: Now it is possible to directly work with Ensembl Genes. Note the annot = annFUN.org, mapping="org.Mm.eg.db":

source("http://www.bioconductor.org/biocLite.R")
biocLite(c("biomaRt", "topGO", "org.Mm.eg.db"))
library(biomaRt)
library(org.Mm.eg.db)
library(topGO)    

all_genes <- rep(c(0,1,1), each=10)

names(all_genes) <- c("ENSMUSG00000031167", "ENSMUSG00000022443", "ENSMUSG00000055762", "ENSMUSG00000031328", "ENSMUSG00000026728", "ENSMUSG00000029722", "ENSMUSG00000059208", "ENSMUSG00000017404", "ENSMUSG00000003970", "ENSMUSG00000019210", "ENSMUSG00000024966", "ENSMUSG00000024197", "ENSMUSG00000030847", "ENSMUSG00000004264", "ENSMUSG00000041959", "ENSMUSG00000032643", "ENSMUSG00000042520", "ENSMUSG00000028484", "ENSMUSG00000022186", "ENSMUSG00000027593", "ENSMUSG00000019795", "ENSMUSG00000021134", "ENSMUSG00000026956", "ENSMUSG00000022982", "ENSMUSG00000027162", "ENSMUSG00000024668", "ENSMUSG00000000168", "ENSMUSG00000022234", "ENSMUSG00000024991", "ENSMUSG00000018697")

GOdata <- new("topGOdata", ontology = "BP", allGenes = all_genes, geneSel = function(p) p < 1e-2, description = "Test", annot = annFUN.org, mapping="org.Mm.eg.db", ID="Ensembl")

resultFisher <- runTest(GOdata, algorithm = "classic", statistic = "fisher")
GenTable(GOdata, classicFisher = resultFisher, topNodes = 10)

score 6 · Answer 2 · 2011-06-27

6

Entering edit mode

13.4 years ago

Sean Davis 27k

Try using the biomaRt Bioconductor package to map your Ensembl IDs to Entrez gene. Then you can apply the GOStats package or topGO.

ADD COMMENT • link 13.4 years ago by Sean Davis 27k

0

Entering edit mode

Nice, I have the same comment to Khader, please read it.

ADD REPLY • link 13.4 years ago by Radhouane Aniba ▴ 790

Ram · Answer 3 · 2011-06-28

GO terms are actually mapped to Ensembl IDs on the transcript level. You can get the associated mappings through BioMart by going in with the Ensembl IDs. Mappings are done through the UniProt ID. Ensembl does not map to KEGG, but Reactome is a pathway database that crosslinks to KEGG, and can be queried using Ensembl IDs. http://www.reactome.org/ReactomeGWT/entrypoint.html They also have a Mart: http://www.reactome.org/cgi-bin/mart

I hope this helps.

score 3 · Answer 4 · 2011-11-01

To get the EntrezGene IDs from the bioconductor annotation packages you can do:

library(org.Hs.eg.db)
ENSEMBL <- c('ENSG00000206454', 'ENSG00000111704', 'ENSG00000115232') # example IDs
egIDs <- unlist(mget(ENSEMBL, org.Hs.egENSEMBL2EG, ifnotfound = NA))

Personally I've had some problems with 'unlist' where more than one ENSEMBL ID links to a single Entrez ID or vice versa as illutrated in this example with ENSG00000111704. This maps to two ENTREZ ids and because R cannot handle duplicate names these ENTREZ ids in the results get named ENSG000001117041 and ENSG000001117042. This makes is harder to map back from ENTREZ to ENSEMBL.

Now I prefer:

egIDs <- stack(mget(ENSEMBL, org.Hs.egENSEMBL2EG, ifnotfound = NA))

Or you could use SQL directly on the org.Hs.eg.db database (these packages are SQLite databases; see the AnnotationDBI documentation).

dbCon <- org.Hs.eg_dbconn() # set up the database connection
sql <- 'SELECT * FROM ensembl, genes WHERE genes._id = ensembl._id;' # write your little bit of SQL
result <- dbGetQuery(dbCon, sql) # query the database

ENTREZ <- result[which(result[,2] %in% ENSEMBL), c(2,4)] # subset to get your results

Once you have the mapping you can use the ENTREZ centric tools although as others point out you may lose a few genes in the process. How this will affect the results is unknown though.

Best

duff

score 2 · Answer 5 · 2011-06-27

2

Entering edit mode

13.4 years ago

David Quigley 11k

Warning: anecdotal opinion not supported by hard data.

Why are many tools targeted at Entrez Ids? I personally think Ensembl a very rich environment and a great toolset, but I spend 99% of my time in the NCBI universe, where entrez IDs are the common denominator. Most academic software packages grow out of what someone needed to write for themselves. My SuperUltraGO package would likely use Entrez IDs because I would write against the toolset I personally use. I would need a particular reason to make it eat Ensembl identifiers as well, even though that would be a good idea in theory.

(There is no SuperUltraGO package. Yet. I think.)

ADD COMMENT • link 13.4 years ago by David Quigley 11k

1

Entering edit mode

+1. I like Ensembl too for its rational way of organising genes, transcripts and proteins.

ADD REPLY • link 13.4 years ago by Khader Shameer 18k

1

Entering edit mode

To answer the question with regard to R/Bioconductor, that is historical, largely. The annotation packages are NCBI-centric and it is the organism-specific annotation packages that allow such tasks as GO or KEGG enrichment. It would be possible (and we have talked about doing it) to make annotation packages that are Ensembl or even UCSC centric, but we haven't gone there yet. That said, it IS possible for a user to build an annotation package using AnnotationDbi that would use Ensembl IDs as the keys.

ADD REPLY • link 13.4 years ago by Sean Davis 27k

score 1 · Answer 6 · 2011-07-11

1

Entering edit mode

13.4 years ago

Tommy C ▴ 10

I recommend the Romer or Roast functions for Gene Set Analysis. They fit nicely in with Limma workflows and avoid problems of establishing a suitable cut-off for DE genes.

vb

Tom

ADD COMMENT • link 13.4 years ago by Tommy C ▴ 10

score 1 · Answer 7 · 2011-11-01

1

Entering edit mode

13.1 years ago

Mushal ▴ 10

Hi y you can modify CORNA, http://corna.sourceforge.net/tutorial.html

ADD COMMENT • link 13.1 years ago by Mushal ▴ 10