Question

Getting Gene Names From Genbank Ids

1

Entering edit mode

13.4 years ago

Dominik ▴ 10

Hi,

i am trying to get the gene names from identifiers e.g. from th nt database, so let´s say GenBank entries. I am working with Java and originally did not want to touch BioPerl to do so.

BioMart does not seem to support GenBank ID´s yet.

So what is the best and easiest way to access the gene names by identifiers online? Is there any web service i could use? I just ran into the E Utilities from NCBI which could do the job i guess. Or shall i integrate Perl in my Java software? The easiest and most straight forward solution is welcome ;-)

genbank gene • 15k views

ADD COMMENT • link updated 13.4 years ago by Babak Memari • 0 • written 13.4 years ago by Dominik ▴ 10

1

Entering edit mode

Perhaps give an example of a Genbank identifier and the associated gene name. These terms can be ambiguous. For example: Genbank accession NM_018689 has GI 38638697 which maps to HGNC symbol KIAA1199 - the latter would be your "gene name"?

ADD REPLY • link 13.4 years ago by Neilfws 49k

0

Entering edit mode

Yes, exactly. When using identifier NM_018689 you should get the gene name KIAA1199, maybe also the synonyms CCSP1 and TMEM2L. But how to easily access these data?

ADD REPLY • link 13.4 years ago by Dominik ▴ 10

0

Entering edit mode

OK; so "BioMart does not seem to support GenBank IDs" is incorrect. Accessions such as NM_* are Refseq mRNA IDs which work fine, as do other kinds of NCBI accession.

ADD REPLY • link 13.4 years ago by Neilfws 49k

score 3 · Answer 1 · 2012-02-28

3

Entering edit mode

13.4 years ago

Neilfws 49k

The answer, as with almost every "map identifier X to identifier Y" problem, is BioMart.

If you search this site for "biomart" or look at the list of related questions on the right, you'll find answers which give step-by-step instructions. Basically, you need to select a database and a dataset for your organism (human, I assume). Next, you specify Refseq mRNA IDs as your filter (assuming your GenBank entries of interest have accessions beginning with NM_) and supply a list of IDs. Finally, you choose "HGNC symbol" as your attribute and click "Results".

The nice people at the Ensembl help desk have made video tutorials for BioMart and put them on YouTube.

There's also programmatic access to BioMart using e.g. Perl or R/Bioconductor.

ADD COMMENT • link 13.4 years ago by Neilfws 49k

0

Entering edit mode

Thanks for your answer, neilfws! Well, that´s exactly the point. I cannot choose a dataset since i do not know the organism, that´s why i am searching for an ID mapper for nt database to get me the gene name if possible. And my hits do not all start with NM_.

I am just searching for an online programmatic solution that does the same job like the ncbi nucleotide web page in principle: take an ID, get the genbank entry and parse it to get the gene name if possible.

ADD REPLY • link 13.4 years ago by Dominik ▴ 10

0

Entering edit mode

Aha, I see. "Arbitrary" nt entries, as opposed to a specific organism. Let me think about that.

ADD REPLY • link 13.4 years ago by Neilfws 49k

0

Entering edit mode

I think we covered it in the answer above. Just to point out: your question implied that BioMart was a potential solution but you thought the problem with it was the IDs. There was no mention of the "no organism" problem.

ADD REPLY • link 13.4 years ago by Neilfws 49k

Neilfws · Answer 2 · 2012-02-28

I suppose the best way to do it is with biomart, as already suggested. Anyway you can also use the E-utilities, building the pipeline: E-search -> E-fetch. To use E-search with a genbank id you need to compose the following string:

"http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term="+genbank_id+"[Nucleotide+Accession]"

where you specify the database to be the genes one and the query a genebank accession number. For more than one id iterate the variable genbank_id. Once you have the string you can retrieve the UID by processing an xml file. I've been using python and I can't tell you how to do it in java, but there's surely an easy way... Anyway in python:

def _getText(node):
    assert node.firstChild.nodeType == node.TEXT_NODE
    return node.firstChild.data

doc=xml.dom.minidom.parse(urllib.urlopen(query))
idlist=map(_getText, doc.getElementsByTagName('Id'))

These ids can be used as an input for the E-fetch:

"http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id="idlist[i]

and finally you can manipulate the data record output to get the information required (just with simple string-matching).

score 0 · Answer 3 · 2012-03-02

0

Entering edit mode

13.4 years ago

Babak Memari • 0

I suggest you to use this site. The HUGO Gene Nomenclature Committee (HGNC) has assigned unique gene symbols and names to over 33,000 human loci, of which around 19,000 are protein coding

link text

ADD COMMENT • link 13.4 years ago by Babak Memari • 0