How Do I Retrieve The Protein Gi Numbers Given The Taxids?
4
2
Entering edit mode
12.8 years ago
Tawny ▴ 180

I have a list of taxon ids and I need to get a list of all of the protein gi numbers for those taxa. The taxon ID list contains over 1700 entries. I have tried to use the soultion provided here: http://biostar.stackexchange.com/questions/17761/refseq-proteins-for-a-given-taxid but this uses Python which is not available on the hpc I use. I can use Perl/BioPerl and have tried to understand the EUtilities package but I am not sure how exactly to get the data I need.
Has anyone done this before?

protein • 6.5k views
ADD COMMENT
5
Entering edit mode
12.8 years ago

You can download and uncompress the gitaxiprot.dmp.gz (or zip file) from ftp://ftp.ncbi.nih.gov/pub/taxonomy/

Then you can obtain all the GIs for a given taxon (say 9606) with a simple:

$ gawk '$2==9606{print $1}' gi_taxid_prot.dmp

or if you prefer to use Perl:

$ perl -lane 'print $F[0] if ($F[1] == 9606)' gi_taxid_prot.dmp

You can even create a bash script:

$ cat > taxid2gi.sh
#!/bin/sh
gawk -v taxid="$1" '$2==taxid{print $1}' $2
^D

Make it executable

$ chmod +x taxid2gi.sh

And run it with any taxid:

./taxid2gi.sh 9606 gi_taxid_prot.dmp > 9606_gis.txt
./taxid2gi.sh 10090 gi_taxid_prot.dmp > 10090_gis.txt

If you have a lot of taxids to check, create a simple Perl script like:

use strict;
use warnings;

# First argument a file with your taxids (1 per line)
my ($taxids, $gitaxid_file) = @ARGV;

open my $taxids_fh, "<", $taxids or die $!;
my %taxids = map {chomp; $_ => 1} (<$taxids_fh>);
close($taxids_fh);

open my $gitaxid_fh, "<", $gitaxid_file or die $!;
while (<$gitaxid_fh>) {
    chomp;
    my ($gi, $taxid) = split;
    if (defined $taxids{$taxid}) {
    print "$_\n";
    }
}
close($gitaxid_fh);

M;

ADD COMMENT
0
Entering edit mode

Miguel, your perl script above worked great for me. I was able to iterate through my large list of taxon IDs and now have the GI list that I need. Thank you for simplifying what I was making much harder than it needed to be!

ADD REPLY
0
Entering edit mode

Great to hear that

ADD REPLY
2
Entering edit mode
12.8 years ago
Bill Pearson ★ 1.0k

An easy way is to go to entrez proteins and use:

txid9606 AND srcdb_refseq[properties]

This gets you all human proteins from RefSeq. If possible, you need srcdb_refseq[properties] to get a sensible set; without that, you would get almost 600,000 entries.

Once you have done this, you can download to a file and select gi numbers.

Alternatively, for eutils, try some code that looks like this (perl), which is designed to get gi numbers using a file of the form:

taxon_id t organism_name [?][?] my $db = 'protein'; my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';

while (my $tx_line = <> ) { chomp ($tx_line); next unless ($tx_line);

my ($taxon, $descr) = split("\t",$tx_line);
my ($sname) = ($descr =~ m/^(\w+)/);
$sname = lc($sname);

my $out_file = $taxon . "_" . $sname . ".gi";

open(FOUT,">$out_file") || die "cannot open $out_file";

my $query = "srcdb_refseq[prop]+AND+$taxon"."[orgn]";

my $url = $base . "esearch.fcgi?db=$db&term=$query&usehistory=y";

#post the esearch URL                                                                                                         
my $esearch_result = get($url);

my ($count, $querykey, $webenv) = ($esearch_result =~
                                   m|<Count>(\d+)</Count>.*<QueryKey>(\d+)</QueryKey>.*<WebEnv>(\S+)</WebEnv>|s);

if ($count < 1) { return "";}

my $retmax=1000;

my $efetch = "";

for(my $retstart = 0; $retstart < $count; $retstart += $retmax) {
    $url = $base . "esearch.fcgi?"
        . "retstart=$retstart&retmax=$retmax&"
        . "db=$db&query_key=$querykey&WebEnv=$webenv";
    $efetch = get($url);

# now extract the gi numbers

    my @new_gis = ($efetch =~ m/<Id>(\d+)<\/Id>/g);

    print FOUT join("\n",@new_gis) . "\n";
}
close FOUT;

} [?][?]

ADD COMMENT
0
Entering edit mode

Is this really feasible with 1700 taxon ids?

ADD REPLY
0
Entering edit mode

Yes, though you may need to check first to see if there are actually refseq sequences for 1700 taxon id's (I suspect not). We routinely o this for hundreds of bacterial taxa.

Alternatively, you might consider simply downloading the taxon id database files from NCBI and either loading them into MYSQL or using a perl/awk script to get the gi's you want. This will give you all GI's, which will mean you have lots of duplicate sequences.

ADD REPLY
1
Entering edit mode
12.4 years ago
Fallino ▴ 20

GNU GREP is faster than GAWK :

grep [[:space:]]9606$ gi_taxid_prot.dmp > gis_from_taxid9606.txt
grep ^12345[[:space:]] gi_taxid_prot.dmp > taxid_from_gi12345.txt
ADD COMMENT
0
Entering edit mode
12.8 years ago
Michael 55k

I have simply translated the python script you found in the question you cite, and left out the last part where the sequence is obtained because you want the gis only. You have to install Bio::DB::SoapEUtilities. I am not 100% sure about the best query string, the original one seems to work:

#!/usr/bin/env perl 
use strict;
use warnings; 
use Bio::DB::SoapEUtilities;
# factory construction
my $fac = Bio::DB::SoapEUtilities->new();
# get a Bio::DB::SoapEUtilities::Result object
my $entrezDbName = 'protein';
my $ncbiTaxId = '1001533'; # Bovine papillomavirus 7
# Find entries matching the query
my $entrezQuery = "refseq[filter] AND txid$ncbiTaxId";
my $result = $fac->esearch(
               -email => 'bla\@blub.org',
               -db => 'protein',
               -term => $entrezQuery)->run;
print join "\n", @{$result->ids()}, "\n";
ADD COMMENT
0
Entering edit mode

Michael, thank you for this SoapEUtilities solution. I hadn't found this package before. I am having trouble getting Perl packages loaded on our organization's hpc as I don't have 'permission' to add packages but one that hurdle is crossed I will try this solution.

ADD REPLY

Login before adding your answer.

Traffic: 1475 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6