Help mapping old GenBank GI to gene symbols using R
2
0
Entering edit mode
10.0 years ago

Hello, I've been given output from a custom human microarray created in 2002 which includes columns for unigene, locus, and gi. I've been using both the org.Hs.eg.db and BioMart libraries in an attempt to map ANY of these to a standard gene symbol. Out of the 8,100 probes, only 394 of the unigenes were mapped, as most of them have been deprecated. Only Some 400 of the locus IDs worked when I specified refseq_mrna as a filter in the ensembl BioMart. These all began with "NM_" but many others in the locust column start with "AV", "AK", "AA" etc. As I understand the "GI" field is an old GenBank gene identifier number, but I can't for the life of me find any programmatic way to get to this. What to do, Bioinformatics gurus? Here's some data:

unigene       locus        gi
Hs.339868     NM_003974    4503358
Hs.108854     AK024569     10436879
Hs.240457     NM_004584    4759021
Hs.179735     NM_005167    4885066
Hs.76728      AV724531     10829010
Hs.288061     AK025375     10437878
Hs.125307     AA836204     2910523
Hs.288061     BC002409     12803202
Hs.251653     AK026594     10439481
Hs.74621      U29185       2865216
NA            BE899595     10367264
Hs.37617      AL532303     12795796
Hs.169824     NM_002258    4504878
Hs.89887      D38081       533325
R • 4.1k views
ADD COMMENT
0
Entering edit mode

As I understand the "GI" field is an old GenBank gene identifier number

No it's a primary key in NCBI genbank.

ADD REPLY
0
Entering edit mode

Which R package and filter can I use to query it?

ADD REPLY
1
Entering edit mode
10.0 years ago
Siva ★ 1.9k

You can see the Revision History for a sequence in Entrez by selecting Revision History in Display Settings (top left corner in the page).

If you want to access programmatically, it seems there are no direct ways like E-utilities. A workaround would be to construct URLs for each GI and parsing the HTML output which you can do in BioPerl

http://doc.bioperl.org/bioperl-live/Bio/DB/SeqVersion/gi.html

As always with programmatic access, please follow the NCBI Usage Guidelines and Requirements

Edit: I think I misunderstood the question. I thought the OP was asking about 'old GIs' that they could not find in the current database.

ADD COMMENT
1
Entering edit mode
10.0 years ago
Chris S. ▴ 340

You can use E-utilities like the Entrez Direct tools below. Just search Entrez gene for the accession in your table and then fetch the XML or other results and optionally parse what you need with xtract when you are familiar with the tags. This will miss the 4 ESTs like AV724531 in your list, but what symbol do you want for those?

esearch -db gene -query NM_003974 | esummary | xtract -pattern DocumentSummary -element Id Name OtherAliases Description
9046    DOK2    p56DOK, p56dok-2        docking protein 2, 56kDa
ADD COMMENT
0
Entering edit mode

I'm curious why we can't just somehow use the GI, which although old is stable. It's not in BioMart is it? I thought I looked all over for it. And why isn't there a BioMart EST? Hmm.

ADD REPLY
0
Entering edit mode

No, I doubt it is in Biomart since these are NCBI nucleotide GIs and there's almost a billion of them. The GIs are unique, but get replaced all the time with new GIs (check the status column below). So I think you can just can search for these GIs in Entrez nucleotide and any linked databases.

esummary -db nucleotide \
  -id 4503358,10436879,4759021,4885066,10829010,10437878,2910523,12803202,10439481,2865216,10367264,12795796,4504878,533325 | \
  xtract -pattern DocumentSummary \
    -element Id Caption Extra Status Title

# and only 4 are linked to Gene
elink -db nucleotide \
  -target gene \
  -id 4503358,10436879,4759021,4885066,10829010,10437878,2910523,12803202,10439481,2865216,10367264,12795796,4504878,533325 | \
  esummary | \
  xtract -pattern DocumentSummary \
    -element Id Name OtherAliases Description
ADD REPLY

Login before adding your answer.

Traffic: 2720 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6