Map Genbank Gi To Accession Numbers (A Million Times)
4
3
Entering edit mode
12.3 years ago
Owen S. ▴ 370

A question of beguiling simplicity:

Given a long list of nucleotide gi numbers, how can one efficiently map them to their genbank accession numbers?

I doubt NCBI EUtils will work for me, because the list is nearly a million long. My understanding is that you can't (or shouldn't) ping their server with so many requests.

I have used ensemble's BioMart plenty in the past, but its organization into species-specific datasets precludes use in my situation, because my gi numbers are from multiple taxonomies.

About the best solution I have been able to come up with is to download the entire blast database repo, and then, for each db, dump the accessions and gis with a command like:

blastdbcmd -db dbname -entry all -outfmt '%a %g

Is there a better way?

Thanks.

genbank • 7.6k views
ADD COMMENT
1
Entering edit mode

You can ping servers with a large number of requests, provided that you respect the limits specified by the provider. The problem then becomes that the process takes days. So your solution is correct: once data goes beyond a certain size, it's better to work locally.

ADD REPLY
3
Entering edit mode
12.3 years ago
Will 4.6k

I've had a similar issue and that was my solution as well. I just dumped out a text-file of GI -> accession numbers and then searched through that. After sorting my file and then converting it into a fixed-width format (so I could skip around with a binary-search) it was easiest/fastest method I could find.

ADD COMMENT
2
Entering edit mode
11.2 years ago

You can also use:

blastdbcmd -db dbname -entry_batch long_list_of_nucleotide_gi_numbers.txt -outfmt '%a %g' -logfile entry_batch_stdout.log

This has the benefit of only outputting the sequences you are interested in, for further downstream analyses. Any failed queries will be kicked out to entry_batch_stdout.log (or whatever you fancy calling it).

I'm doing this with the '%a %T' output format at the moment to get a list of accession numbers and taxonomic IDs.

ADD COMMENT
0
Entering edit mode
12.3 years ago

You might want to have a look at http://www.bridgedb.org. We created that to make your (local) life easier when it comes to identifier mapping.

More info here: http://dx.doi.org/10.1186/1471-2105-11-5

That will not immediately solve your multiple species problem though, since the "standard" BridgeDB databases are single species as well. You could just run over each of these, or create your own. We also have a homologene cross species mapping database, which could be stacked on any of the others, but if I understand your question correctly you will not need that.

ADD COMMENT
0
Entering edit mode

Thanks, I had forgotten about bridgedb, I used it in the past and found it useful. But as you point out, it is not exactly the solution to this particular question, due to the multiple species.

ADD REPLY

Login before adding your answer.

Traffic: 1771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6