Entering edit mode
3.0 years ago
kai_bio
▴
50
I have a DEGs data frame with Gene IDs. Pic for reference below
I am trying to convert the Gene_IDs into Ensembl IDs. I have tried the following methods
library("AnnotationDbi")
library("org.Hs.eg.db")
res3$ensid = mapIds(org.Hs.eg.db,
keys=res3$Gene_ID,
column="ENSEMBL",
keytype = "SYMBOL",
multiVals = "first")
The above code converted most of the gene IDs but gave NA values for a couple of them. Can someone please shed some light on this as I can't understand why?
Also tried with biomaRt
package
library("biomaRt")
listMarts()
ensembl <- useMart("ensembl")
datasets <- listDatasets(ensembl)
ensembl = useDataset("hsapiens_gene_ensembl", mart = ensembl)
options(max.print = 1000000)
res3$ensid <- getBM(attributes = c('external_gene_name','ensembl_gene_id'), filters = 'external_gene_name',
values = res3$Gene_ID, mart = ensembl, uniqueRows = FALSE)
but giving the following error
Error in `$<-.data.frame`(`*tmp*`, ensid, value = list(external_gene_name = c("KRT23", :
replacement has 16202 rows, data has 17281
which shows there are more number of rows with Gene IDs in the data. can someone please guide me? Thank you!
It should be expected that there will be incomplete mappings between different annotation systems. Each [annotation] system has different rules about what to annotate, i.e., what to include. In particular, when dealing with gene symbols, one should expect a difficulty. You could look at some of those genes that are not mapping, and then investigate further. They may be using some 'alias' that is not the official gene symbol, or they may relate to some obscure predicted gene or non-coding RNA that is not even validated.
Thanks for the explanation
As Kevin said there won't always be a perfect mapping between different ID types. One option is to join the two datasets and fill in NA for those that are missing.
Thank you! It's working and I was able to join the Ensembl Ids to my available gene ids.
As others stated, not all gene symbols will have an id in another database. Sometimes this is due to changes in gene annotations, for instance if you are using gene symbols from an older genome assembly versus what is in org.Hs.eg.db which is updated regularly (an issue I have encountered before).
One additional option for pulling more info for your genes, like possible alias symbols, is to query the HGNC Rest API https://www.genenames.org/help/rest/