Gene ID mapping of Ensembl IDs
2
3
Entering edit mode
8.7 years ago

I have a list of Ids that appear are Ensembl transcript IDs; I want to map these ids to gene names, but when I use the biomart view on ensembl it only gives me the transcript IDs without gene name. The other issue is that the data I have has decimal points in the ensembl IDs, whereas when downloading IDs from ensembl using martview no IDs with decimal points are given.

Example of ID's I have

ENST00000576171.1   ENSG00000273172.1
ENST00000338094.6   ENSG00000273173.1
ENST00000338327.4   ENSG00000273173.1
ENST00000577949.1   ENSG00000273173.1
ENST00000580062.1   ENSG00000273173.1
RNA-Seq gene ID ensembl • 9.0k views
ADD COMMENT
4
Entering edit mode
2.5 years ago

You can pass these IDs directly to gget info with or without the version number (the number behind the decimal). gget works from the command line or a Python environment, like JupyterLab.

pip install gget, then simply:

# Command-line
gget info ENST00000576171.1 ENSG00000273172.1
# Python
import gget
gget.info(["ENST00000576171.1", "ENSG00000273172.1"])
ADD COMMENT
1
Entering edit mode

If you're gonna plug your tool on all the appropriate questions (which is fine), you might consider making a tool post to advertise it more broadly.

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Oh, good look, I didn't see it originally.

ADD REPLY
2
Entering edit mode
8.7 years ago

You likely just need to change the attributes that you want to get from biomart. Here's an example query with your example values that returns the gene symbol and name.

ADD COMMENT
0
Entering edit mode

I figured that out after some digging, but thank you for the insight. The other issue is with the decimal points in the IDs. I see these in ensemble but the biomart browser does not appear to be able to provide ensemble IDs with the decimal point. If I am just trying to insert gene names would it be best to remove the decimal points and assign the IDs using that list?

ADD REPLY
1
Entering edit mode

I think the decimal parts of the ENSG's can be dropped. I have done so in the past and have still been able to convert. One thing to note is that some ENSG's (depending on which reference you are using) are old and have been retired. As such they wont map to anything.

Here is some R code I often use (not mine originally can't remember where I found it)

convertIDs <- function( ids, from, to, db, ifMultiple=c("putNA", "useFirst")) {
  stopifnot( inherits( db, "AnnotationDb" ) )
  ifMultiple <- match.arg( ifMultiple )
  suppressWarnings( selRes <- AnnotationDbi::select(
    db, keys=ids, keytype=from, columns=c(from,to) ) )

  if ( ifMultiple == "putNA" ) {
    duplicatedIds <- selRes[ duplicated( selRes[,1] ), 1 ]
    selRes <- selRes[ ! selRes[,1] %in% duplicatedIds, ]
  }

  return( selRes[ match( ids, selRes[,1] ), 2 ] )
}

It requires library(org.Hs.eg.db) to work for human genes. Here is an example call:

results$entrez <- convertIDs(my_list_of_ensgs, "ENSEMBL", "ENTREZID", org.Hs.eg.db)
results$symbol <- convertIDs(my_list_of_ensgs, "ENSEMBL", "SYMBOL", org.Hs.eg.db)

Do note that it will return NAs for gene with no mapping (to maintain the size of the list). You can either filter or manually look these up.

ADD REPLY

Login before adding your answer.

Traffic: 2752 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6