How do you get gene symbols from microarray accession numbers?
2
0
Entering edit mode
4.6 years ago
dikisakye • 0

I have a list of microarray gene accession numbers and would like to obtain the gene symbols. Any recommendations on how to go about it?

gene • 1.7k views
ADD COMMENT
1
Entering edit mode

Don't SHOUT please.

ADD REPLY
0
Entering edit mode

ok, sorry about that.

ADD REPLY
0
Entering edit mode

Just as a side note, getting the symbols should be one of the later steps in the analysis, for anything else they are redundant.

ADD REPLY
0
Entering edit mode

When referring to any kind of ID's please provide examples.

ADD REPLY
0
Entering edit mode

I am in the preliminary steps of conducting a meta analysis of microarray data, this is my first time to analyse any data. I need the gene IDs to feed them into one of the databases for pathway and enrichment analysis. One dataset has the following gene bank accession numbers( NM_010378, NM_010382, NM_008873, NM_016701, NM_178057, NM_028072, NM_020581, NM_010441, NM_207105, ,NM_031254,NM_007631, XM_001005899, NM_029422, NM_147217,XM_993267, NM_175406, XM_985034 etc)

ADD REPLY
2
Entering edit mode
4.6 years ago

If the species of interest is in Ensembl, try Ensembl's Biomart. See how to use it here. ID conversion is one of the most common bioinformatics tasks so you should consider learning to do it programmatically if you're going to be doing this more than a couple of times.

ADD COMMENT
0
Entering edit mode

I haven't done any analyses yet, this is my first task. I notice though that my data was has Gene bank accession numbers while the ensembl doesn't have these. Might you know how to perform such tasks in R? And, what sort of packages I need for it? Plus, any other links to materials that might be helpful Thanks!

ADD REPLY
0
Entering edit mode

For R, the Bioconductor org.XXX.eg.db packages contain objects mapping between different types of identifiers for the XXX organisms. Make sure you make note of which version you're using as these packages are updates regularly. Use it like this (example for human):

library("org.Hs.eg.db")
gene.symbols <- mapIds(org.Hs.eg.db, keys = list.of.IDs, keytype = "ENTREZID", column="SYMBOL")

Alternatively, use the biomaRt package. Read the vignette to see how to use it.

ADD REPLY
0
Entering edit mode

Thanks! Will give feedback on progress

ADD REPLY
0
Entering edit mode

Hi Jean- Karim, I used the biomaRt package for mapping. I had 2 datasets, one for homo sapiens and the other for the mouse. I was however left with a large amount of unmapped data. So I thought I should use the annotationhub bioconductor package to map some of those. So I have used the argument you provided but i seem to get an error. How do I perform the function correctly? This was my input;(acc.not.mapped is a vector of length 10886 that contains a list of the accession numbers that were not mapped.)

head(acc.not.mapped)
[1] "AY766452"     "XR_109632"    "AK130765"     "NM_020914"    "NM_001077493" "AY358259" 
acc.not.mapped  %>% as.data.frame
gene.symbols <- mapIds(org.Hs.eg.db, keys = acc.not.mapped, keytype = "ENTREZID", column="SYMBOL")
"Error in .testForValidKeys(x, keys, keytype, fks) : None of the keys entered are valid keys for 'ENTREZID'. Please use the keys method to see a listing of valid arguments."
ADD REPLY
0
Entering edit mode

Thanks for the modification genomax, how do I rectify the issue?

ADD REPLY
0
Entering edit mode
$ esearch -db nuccore -query AK130765 | elink -target gene | efetch -format docsum | xtract -pattern DocumentSummary -element Name
LOC105378085
$ esearch -db nuccore -query AY766452 | elink -target gene | efetch -format docsum | xtract -pattern DocumentSummary -element Name
CCL4L2
$ esearch -db nuccore -query NM_001077493 | elink -target gene | efetch -format docsum | xtract -pattern DocumentSummary -element Name
QueryKey value not found in summary input
ADD REPLY
0
Entering edit mode

Thanks for the help genomax

ADD REPLY
1
Entering edit mode
4.6 years ago
GenoMax 148k

You can use EntrezDirect:

$ more id
NM_010378
NM_010382
NM_008873
NM_016701
NM_178057
NM_028072
NM_020581
NM_010441
NM_207105

$ for i in $(cat id); do printf ${i}"\t"; esearch -db nuccore -query ${i} | elink -target gene | efetch -format docsum | xtract -pattern DocumentSummary -element Name; done
NM_010378       H2-Aa
NM_010382       H2-Eb1
NM_008873       Plau
NM_016701       Nes
NM_178057       QueryKey value not found in summary input
NM_028072       Sulf2
NM_020581       Angptl4
NM_010441       Hmga2
NM_207105       H2-Ab1
ADD COMMENT
0
Entering edit mode

Thanks! I would like to try this out too to enrich my experience. I am a beginner with programming so I understand the basic loops, for this particular one, could you please elaborate on what the meaning of this section of the argument; "\t"; esearch -db nuccore -query $ {I} | elink

ADD REPLY
1
Entering edit mode

Each entry read from the file called id is passed to Entrezdirect for search, specifically for esearch program that is part of that package. Since Entrezdirect does not keep track of original search terms I am printing that out with printf so you can know which term belongs to gene name that is looked up.

If you want an easy option you can use batch search on this MGI Informatics page. Paste your ID's in/upload a file with them and hit search.

ADD REPLY

Login before adding your answer.

Traffic: 1815 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6