A question of beguiling simplicity:
Given a long list of nucleotide gi numbers, how can one efficiently map them to their genbank accession numbers?
I doubt NCBI EUtils will work for me, because the list is nearly a million long. My understanding is that you can't (or shouldn't) ping their server with so many requests.
I have used ensemble's BioMart plenty in the past, but its organization into species-specific datasets precludes use in my situation, because my gi numbers are from multiple taxonomies.
About the best solution I have been able to come up with is to download the entire blast database repo, and then, for each db, dump the accessions and gis with a command like:
blastdbcmd -db dbname -entry all -outfmt '%a %g
Is there a better way?
Thanks.
You can ping servers with a large number of requests, provided that you respect the limits specified by the provider. The problem then becomes that the process takes days. So your solution is correct: once data goes beyond a certain size, it's better to work locally.