Question

Map Genbank Gi To Accession Numbers (A Million Times)

3

Entering edit mode

12.3 years ago

Owen S. ▴ 370

A question of beguiling simplicity:

Given a long list of nucleotide gi numbers, how can one efficiently map them to their genbank accession numbers?

I doubt NCBI EUtils will work for me, because the list is nearly a million long. My understanding is that you can't (or shouldn't) ping their server with so many requests.

I have used ensemble's BioMart plenty in the past, but its organization into species-specific datasets precludes use in my situation, because my gi numbers are from multiple taxonomies.

About the best solution I have been able to come up with is to download the entire blast database repo, and then, for each db, dump the accessions and gis with a command like:

blastdbcmd -db dbname -entry all -outfmt '%a %g

Is there a better way?

Thanks.

genbank • 7.6k views

ADD COMMENT • link updated 11.2 years ago by Steve Moss 2.3k • written 12.3 years ago by Owen S. ▴ 370

1

Entering edit mode

You can ping servers with a large number of requests, provided that you respect the limits specified by the provider. The problem then becomes that the process takes days. So your solution is correct: once data goes beyond a certain size, it's better to work locally.

ADD REPLY • link 12.3 years ago by Neilfws 49k

score 3 · Answer 1 · 2012-08-01

I've had a similar issue and that was my solution as well. I just dumped out a text-file of GI -> accession numbers and then searched through that. After sorting my file and then converting it into a fixed-width format (so I could skip around with a binary-search) it was easiest/fastest method I could find.

score 2 · Answer 2 · 2013-10-01

You can also use:

blastdbcmd -db dbname -entry_batch long_list_of_nucleotide_gi_numbers.txt -outfmt '%a %g' -logfile entry_batch_stdout.log

This has the benefit of only outputting the sequences you are interested in, for further downstream analyses. Any failed queries will be kicked out to entry_batch_stdout.log (or whatever you fancy calling it).

I'm doing this with the '%a %T' output format at the moment to get a list of accession numbers and taxonomic IDs.

score 0 · Answer 3 · 2012-08-02

0

Entering edit mode

12.3 years ago

Chris Evelo 10k

You might want to have a look at http://www.bridgedb.org. We created that to make your (local) life easier when it comes to identifier mapping.

More info here: http://dx.doi.org/10.1186/1471-2105-11-5

That will not immediately solve your multiple species problem though, since the "standard" BridgeDB databases are single species as well. You could just run over each of these, or create your own. We also have a homologene cross species mapping database, which could be stacked on any of the others, but if I understand your question correctly you will not need that.

ADD COMMENT • link 12.3 years ago by Chris Evelo 10k

0

Entering edit mode

Thanks, I had forgotten about bridgedb, I used it in the past and found it useful. But as you point out, it is not exactly the solution to this particular question, due to the multiple species.

ADD REPLY • link 12.2 years ago by Owen S. ▴ 370