I have a list of microarray gene accession numbers and would like to obtain the gene symbols. Any recommendations on how to go about it?
I have a list of microarray gene accession numbers and would like to obtain the gene symbols. Any recommendations on how to go about it?
If the species of interest is in Ensembl, try Ensembl's Biomart. See how to use it here. ID conversion is one of the most common bioinformatics tasks so you should consider learning to do it programmatically if you're going to be doing this more than a couple of times.
I haven't done any analyses yet, this is my first task. I notice though that my data was has Gene bank accession numbers while the ensembl doesn't have these. Might you know how to perform such tasks in R? And, what sort of packages I need for it? Plus, any other links to materials that might be helpful Thanks!
For R, the Bioconductor org.XXX.eg.db packages contain objects mapping between different types of identifiers for the XXX organisms. Make sure you make note of which version you're using as these packages are updates regularly. Use it like this (example for human):
library("org.Hs.eg.db")
gene.symbols <- mapIds(org.Hs.eg.db, keys = list.of.IDs, keytype = "ENTREZID", column="SYMBOL")
Alternatively, use the biomaRt package. Read the vignette to see how to use it.
Hi Jean- Karim, I used the biomaRt package for mapping. I had 2 datasets, one for homo sapiens and the other for the mouse. I was however left with a large amount of unmapped data. So I thought I should use the annotationhub bioconductor package to map some of those. So I have used the argument you provided but i seem to get an error. How do I perform the function correctly? This was my input;(acc.not.mapped is a vector of length 10886 that contains a list of the accession numbers that were not mapped.)
head(acc.not.mapped)
[1] "AY766452" "XR_109632" "AK130765" "NM_020914" "NM_001077493" "AY358259"
acc.not.mapped %>% as.data.frame
gene.symbols <- mapIds(org.Hs.eg.db, keys = acc.not.mapped, keytype = "ENTREZID", column="SYMBOL")
"Error in .testForValidKeys(x, keys, keytype, fks) : None of the keys entered are valid keys for 'ENTREZID'. Please use the keys method to see a listing of valid arguments."
$ esearch -db nuccore -query AK130765 | elink -target gene | efetch -format docsum | xtract -pattern DocumentSummary -element Name
LOC105378085
$ esearch -db nuccore -query AY766452 | elink -target gene | efetch -format docsum | xtract -pattern DocumentSummary -element Name
CCL4L2
$ esearch -db nuccore -query NM_001077493 | elink -target gene | efetch -format docsum | xtract -pattern DocumentSummary -element Name
QueryKey value not found in summary input
You can use EntrezDirect:
$ more id
NM_010378
NM_010382
NM_008873
NM_016701
NM_178057
NM_028072
NM_020581
NM_010441
NM_207105
$ for i in $(cat id); do printf ${i}"\t"; esearch -db nuccore -query ${i} | elink -target gene | efetch -format docsum | xtract -pattern DocumentSummary -element Name; done
NM_010378 H2-Aa
NM_010382 H2-Eb1
NM_008873 Plau
NM_016701 Nes
NM_178057 QueryKey value not found in summary input
NM_028072 Sulf2
NM_020581 Angptl4
NM_010441 Hmga2
NM_207105 H2-Ab1
Thanks! I would like to try this out too to enrich my experience. I am a beginner with programming so I understand the basic loops, for this particular one, could you please elaborate on what the meaning of this section of the argument; "\t"; esearch -db nuccore -query $ {I} | elink
Each entry read from the file called id
is passed to Entrezdirect for search, specifically for esearch
program that is part of that package. Since Entrezdirect
does not keep track of original search terms I am printing that out with printf
so you can know which term belongs to gene name that is looked up.
If you want an easy option you can use batch search
on this MGI Informatics page. Paste your ID's in/upload a file with them and hit search.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Don't SHOUT please.
ok, sorry about that.
Just as a side note, getting the symbols should be one of the later steps in the analysis, for anything else they are redundant.
When referring to any kind of ID's please provide examples.
I am in the preliminary steps of conducting a meta analysis of microarray data, this is my first time to analyse any data. I need the gene IDs to feed them into one of the databases for pathway and enrichment analysis. One dataset has the following gene bank accession numbers( NM_010378, NM_010382, NM_008873, NM_016701, NM_178057, NM_028072, NM_020581, NM_010441, NM_207105, ,NM_031254,NM_007631, XM_001005899, NM_029422, NM_147217,XM_993267, NM_175406, XM_985034 etc)