An easy way is to go to entrez proteins and use:
txid9606 AND srcdb_refseq[properties]
This gets you all human proteins from RefSeq. If possible, you need srcdb_refseq[properties] to get a sensible set; without that, you would get almost 600,000 entries.
Once you have done this, you can download to a file and select gi numbers.
Alternatively, for eutils, try some code that looks like this (perl), which is designed to get gi numbers using a file of the form:
taxon_id t organism_name
[?][?]
my $db = 'protein';
my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
while (my $tx_line = <> ) {
chomp ($tx_line);
next unless ($tx_line);
my ($taxon, $descr) = split("\t",$tx_line);
my ($sname) = ($descr =~ m/^(\w+)/);
$sname = lc($sname);
my $out_file = $taxon . "_" . $sname . ".gi";
open(FOUT,">$out_file") || die "cannot open $out_file";
my $query = "srcdb_refseq[prop]+AND+$taxon"."[orgn]";
my $url = $base . "esearch.fcgi?db=$db&term=$query&usehistory=y";
#post the esearch URL
my $esearch_result = get($url);
my ($count, $querykey, $webenv) = ($esearch_result =~
m|<Count>(\d+)</Count>.*<QueryKey>(\d+)</QueryKey>.*<WebEnv>(\S+)</WebEnv>|s);
if ($count < 1) { return "";}
my $retmax=1000;
my $efetch = "";
for(my $retstart = 0; $retstart < $count; $retstart += $retmax) {
$url = $base . "esearch.fcgi?"
. "retstart=$retstart&retmax=$retmax&"
. "db=$db&query_key=$querykey&WebEnv=$webenv";
$efetch = get($url);
# now extract the gi numbers
my @new_gis = ($efetch =~ m/<Id>(\d+)<\/Id>/g);
print FOUT join("\n",@new_gis) . "\n";
}
close FOUT;
}
[?][?]
Miguel, your perl script above worked great for me. I was able to iterate through my large list of taxon IDs and now have the GI list that I need. Thank you for simplifying what I was making much harder than it needed to be!
Great to hear that