Question

Error while converting Gene ID to Ensembl IDs

0

Entering edit mode

3.0 years ago

kai_bio ▴ 50

I have a DEGs data frame with Gene IDs. Pic for reference below

DEGs List

I am trying to convert the Gene_IDs into Ensembl IDs. I have tried the following methods

library("AnnotationDbi")
library("org.Hs.eg.db")
res3$ensid = mapIds(org.Hs.eg.db,
                      keys=res3$Gene_ID,
                    column="ENSEMBL",
                    keytype = "SYMBOL",
                    multiVals = "first")

The above code converted most of the gene IDs but gave NA values for a couple of them. Can someone please shed some light on this as I can't understand why?

Also tried with biomaRt package

library("biomaRt")
listMarts()
ensembl <- useMart("ensembl")
datasets <- listDatasets(ensembl)
ensembl = useDataset("hsapiens_gene_ensembl", mart = ensembl)
options(max.print = 1000000)
res3$ensid <- getBM(attributes = c('external_gene_name','ensembl_gene_id'), filters = 'external_gene_name',
               values = res3$Gene_ID, mart = ensembl, uniqueRows = FALSE)

but giving the following error

Error in `$<-.data.frame`(`*tmp*`, ensid, value = list(external_gene_name = c("KRT23",  : 
  replacement has 16202 rows, data has 17281

which shows there are more number of rows with Gene IDs in the data. can someone please guide me? Thank you!

ensembl R biomart RNA-Seq • 2.5k views

ADD COMMENT • link updated 3.0 years ago by jv ★ 1.8k • written 3.0 years ago by kai_bio ▴ 50

1

Entering edit mode

It should be expected that there will be incomplete mappings between different annotation systems. Each [annotation] system has different rules about what to annotate, i.e., what to include. In particular, when dealing with gene symbols, one should expect a difficulty. You could look at some of those genes that are not mapping, and then investigate further. They may be using some 'alias' that is not the official gene symbol, or they may relate to some obscure predicted gene or non-coding RNA that is not even validated.

ADD REPLY • link 3.0 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks for the explanation

ADD REPLY • link 3.0 years ago by kai_bio ▴ 50

1

Entering edit mode

As Kevin said there won't always be a perfect mapping between different ID types. One option is to join the two datasets and fill in NA for those that are missing.

library("dplyr")

ids <- getBM(
  attributes = c('external_gene_name','ensembl_gene_id'), filters = 'external_gene_name',
  values = res3$Gene_ID, mart = ensembl, uniqueRows = FALSE)

res3 <- left_join(res3, ids, by=c("Gene_ID"="external_gene_name"))

ADD REPLY • link 3.0 years ago by rpolicastro 13k

0

Entering edit mode

Thank you! It's working and I was able to join the Ensembl Ids to my available gene ids.

ADD REPLY • link 3.0 years ago by kai_bio ▴ 50

0

Entering edit mode

As others stated, not all gene symbols will have an id in another database. Sometimes this is due to changes in gene annotations, for instance if you are using gene symbols from an older genome assembly versus what is in org.Hs.eg.db which is updated regularly (an issue I have encountered before).

One additional option for pulling more info for your genes, like possible alias symbols, is to query the HGNC Rest API https://www.genenames.org/help/rest/

ADD REPLY • link 3.0 years ago by jv ★ 1.8k