Question

Bioperl Get Gene Id From Protein Accession Using Eutilities

4

Entering edit mode

11.8 years ago

rwn ▴ 610

Hello,

I would like to get a list of NCBI-type gene IDs from a list of protein accession numbers. I am attempting to use the BioPerl EUtils for this job (there might be other ways, perhaps better?), using eLink to search across DB's. In other words, I would like to go from something like YP_273711.1 to 3557579, from which I can then pass to eSummary and eFetch to get the actual DNA sequence of that protein. This is my simple code (with a test accession in @ids):

#!/usr/bin/perl

use strict;
use warnings;

use Bio::DB::EUtilities;

my @ids     = qw(YP_273711.1);

my $factory = Bio::DB::EUtilities->new(-eutil          => 'elink',
                                       -email          => 'foo@bar',
                                       -db             => 'gene',
                                       -dbfrom         => 'protein',
                                       -correspondence => 1,
                                       -id             => \@ids);

# iterate through the LinkSet objects
while (my $ds = $factory->next_LinkSet) {
    print "  Link name: ",$ds->get_link_name,"\n";
    print "Protein IDs: ",join(',',$ds->get_submitted_ids),"\n";
    print "   Gene IDs: ",join(',',$ds->get_ids),"\n";
}

Now why on earth is this not working?? When run, it returns absolutely nothing, suggesting that there are no LinkSet objects generated in the first place. I would have thought that this kind of think is precisely what eLink is supposed to used for... I know the two pieces of info are cross-linked: simply querying YP_273711.1 in NCBI's search-engine brings up the gene ID under the 'Gene: gene-centered information' tab of the search platform.

Does anyone have any suggestions, either as to why the above doesn't return what I need, or better ways of doing this? Am I being an idiot and doing something daft?

Or is it necessary to first go from accession to GI? ie, from YP_273711.1 -> 71735245 -> 3557579 ?? Thus, is this possible using eLink?

It's very frustrating, I thought I'd nailed it...

Many thanks for your time, Reuben

PS, does anyone know what the precise difference between a GI and a gene ID is? Sorry for the ton of questions...

bioperl ncbi • 7.6k views

ADD COMMENT • link 10.0 years ago by rwn ▴ 610

0

Entering edit mode

Tip: EUtils requires only something that looks like an email address, as opposed to your real one :)

ADD REPLY • link 11.8 years ago by Neilfws 49k

score 3 · Answer 1 · 2013-02-27

Part of the confusion may stem from the fact that you are trying to map gene IDs and accessions that are not related. Also, you are not starting with an ID. That's okay, it is easy to get the ID from the term with esearch, then use the same code you have in place to get the gene ID through elink.

#!/usr/bin/env perl                                      

use strict;
use warnings;
use Bio::DB::EUtilities;

my @ids = qw(YP_273711.1);

my $factory = Bio::DB::EUtilities->new(-eutil   => 'esearch',
                                       -db      => 'protein',
                                       -email   => 'someemail@somedomain',
                                       -correspondence => 1,
                                       -term => \@ids,
                                       );

my @uids = $factory->get_ids;

$factory->reset_parameters(-eutil => 'elink',
                           -db    => 'gene',
                           -dbfrom => 'protein',
                           -id     => \@uids,
                           );

while (my $ds = $factory->next_LinkSet) {
    print "   Link name: ",$ds->get_link_name,"\n";
    print "Protein IDs: ",join(',',$ds->get_submitted_ids),"\n";
    print "    Gene IDs: ",join(',',$ds->get_ids),"\n";
}

Running this gives....

Link name: protein_gene
Protein IDs: 71736864
Gene IDs: 3559251

Just one more step gets you the sequence you want, but I'll leave that for you to implement. I don't want to take all the fun out of this for you ;).

score 2 · Answer 2 · 2013-02-28

Thanks very much for your help!

I eventually worked it out, using eSearch followed by eLink followed by eFetch to get the corresponding CDS (as both posters suggested above).

My think my confusion stemmed primarily from not really understanding what exactly GI's, gene ID's, accessions, UID's etc etc were, in terms of what was appropriate to use as an initial query. For example, I couldn't understand why (in my initial question) using an accession as a query and setting '-dbfrom => protein' and 'db => gene' didn't get me what I needed. But I now understand that both the protein seq and its corresponding CDS have unique GI numbers, even though they refer to the same biological entity (you know what I mean!).

Anyway, I cobbled together the following from scraps of the EUtilites Cookbook and it seems to do what I want, in case it's ever of use for someone else. It takes a multi fasta file of proteins with accessions (ie >YP_273711.1 as fasta header) and returns a multi fasta of corresponding CDS's, where the fasta header is made up of 4 bits of info:

locusname | proteinaccession | protienGI# | geneid#

i.e. hopefully everything I'll ever need!

Here it is:

#!/usr/bin/perl

use strict;
use warnings;

use Bio::Seq;
use Bio::SeqIO;
use Bio::DB::EUtilities;

use List::Compare;

## Takes a list of protein seqs, with accessions, and uses NCBI's Eutils to get
## the corresponding CDS nucleotide sequence.

print "\n\t#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#\n\t# eutils_DNAFromProteinAccession.pl #\n\t#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#\n\n";

## open IN and OUT
my $seqio_in = Bio::SeqIO->new( -file => $ARGV[0], -format => 'fasta' );
my $outfile = $ARGV[0];
$outfile =~ s/.fasta/_DNA.fasta/;
my $seqio_out = Bio::SeqIO->new( -file => ">".$outfile, -format => 'fasta' );

## get list of accessions 
my @accessions;
while ( my $seq = $seqio_in->next_seq() ) {
    if ( $seq->display_id() =~ m/\|/) {
        my $header = $seq->display_id();
        my @a = split (/\|/, $header);
        push ( @accessions, $a[1] );
    } else {
        push ( @accessions, $seq->display_id() );
    }
}

print "\tThere are ".@accessions." sequences in input file: '".$ARGV[0]."'\n\n";

print "\tGetting GI's from accessions... ";

## use eSearch to convert accs into GI's. the trick here is to pass eSearch -term  
## a COMMA-SEPERATED list using join:
my $factory = Bio::DB::EUtilities->new(-eutil  => 'esearch',
                                       -email  => 'email@email.com',
                                       -db     => 'protein',
                                       -term   => join(',',@accessions),
                                       -retmax => 500 );

## get GI's
my @gis = $factory->get_ids;
print scalar(@gis)." found\n";

## use eLink to get gene IDs from GI's
$factory->reset_parameters(-eutil  => 'elink',
                           -email  => 'email@email.com',
                           -db     => 'gene',
                           -dbfrom => 'protein',
                           -id     => \@gis,
                           -retmax => 500 );

my @gene_ids;

print "\tGetting Gene ID's from GI's...\n";

## iterate through the LinkSet objects
while (my $ds = $factory->next_LinkSet) {
    # print "\t       Link name: ",$ds->get_link_name,"\n";
    # print "\tProtein IDs (GI): ",join(', ',$ds->get_submitted_ids),"\n";
    # print "\t        Gene IDs: ",join(', ',$ds->get_ids),"\n";

    ## get gene ID's
    push (@gene_ids, $ds->get_ids);
}
print "\n\t".scalar(@gene_ids)." gene ID's found\n\n\tMapping details:\n\tacc <-> gi <-> gene_id\n";

for my $i ( 0 .. $#accessions ) {
    print "\t".$accessions[$i]." <-> ".$gis[$i]." <-> ".$gene_ids[$i]."\n"; 
}

my $eutil   = Bio::DB::EUtilities->new(-eutil  => 'esummary',
                                       -email  => 'email@email.com',
                                       -db     => 'gene',
                                       -id     => \@gene_ids,
                                       -retmax => 500 );

my $fetcher = Bio::DB::EUtilities->new(-eutil   => 'efetch',
                                       -email   => 'email@email.com',
                                       -db      => 'nucleotide',
                                       -rettype => 'gb',
                                       -retmax  => 500 );

print "\n\tGetting sequence data...\n\n";
my @check;

while (my $docsum = $eutil->next_DocSum) {

  ## to ensure we grab the right ChrStart information, we grab the Item above
  ## it in the Item hierarchy (visible via print_all from the eutil instance)
  my ($item) = $docsum->get_Items_by_name('GenomicInfoType');

  my %item_data = map {$_ => 0} qw(ChrAccVer ChrStart ChrStop);

  while (my $sub_item = $item->next_subItem) {
    if (exists $item_data{$sub_item->get_name}) {
            $item_data{$sub_item->get_name} = $sub_item->get_content;
    }
  }

  ## check to make sure everything is set
  for my $check (qw(ChrAccVer ChrStart ChrStop)) {
    die "\t$check not set" unless $item_data{$check};
  }

  my $strand = $item_data{ChrStart} > $item_data{ChrStop} ? 2 : 1;
  printf("\tRetrieving %s, from %d-%d, strand %d\n", $item_data{ChrAccVer},
                                                     $item_data{ChrStart},
                                                     $item_data{ChrStop},
                                                     $strand
                                                     );

    my $length = $item_data{ChrStart} - $item_data{ChrStop};
    if ( $length < 0 ) {
        $length = $length * -1;
    }
    print "\tLength of seq: ".$length." bp\n";

  $fetcher->set_parameters(-id        => $item_data{ChrAccVer},
                           -seq_start => $item_data{ChrStart} + 1,
                           -seq_stop  => $item_data{ChrStop} + 1,
                           -strand    => $strand);

  ## dump into a temp genbank file 
  my $tmp_file = 'tmp.gb';
  $fetcher->get_Response( -file => $tmp_file );

  my $seqin = Bio::SeqIO->new(-file => $tmp_file, -format => 'genbank');

  my ($organism, $gene_product);

  ## get some cross-link information for each seq, and incorporate it into display_id
    while ( my $seq = $seqin->next_seq ) {

        my %display_id;
         for my $feat_object ($seq->get_SeqFeatures) {

            ## get length of current SeqFeature
            my $length2 = ($feat_object->end) - ($feat_object->start);

            ## IMPORTANT ##
            ## because of overlapping CDS in bacterial genomes,
            ## lots of GenBank files have >1 CDS in them, and more than one protein 
            ## product etc. By matching up the length of the retrieved seq with the 
            ## expected length found within the gb file, we ensure that the correct 
            ## information is retrieved and passed to $name

            if ($length2 == $length) {

                for my $tag ($feat_object->get_all_tags) {  
                    for my $value ($feat_object->get_tag_values($tag)) {

                        ## get locus_tag, protein_id, $organism and $gene_product
                        if ($tag =~ m/^locus_tag/) {
                            $display_id{$value}=();
                        } elsif ($tag =~ m/^protein_id/) {
                            $display_id{$value}=();
                        } elsif ($tag =~ m/organism/) {
                            $organism = $value;
                        } elsif ($tag =~ m/product/) {
                            $gene_product = $value;
                        }

                        ## get both the GI and GeneID
                        if (($value =~ m/^GI:/) or ($value =~ m/^GeneID:/)) {
                            $display_id{$value}=();
                        }
                    }          
                }  
            }     
        }

        $gene_product =~ s/\s+/_/g;
        $organism =~ s/\s+/_/g;
        $organism =~ s/\.//g;

        ## make the seq name out of info gathered from above        
        my $name = join ('|', sort {$b cmp $a} keys %display_id)."|".$gene_product."|".$organism;

        foreach (sort {$b cmp $a} keys %display_id) {
            print "\t\t".$_."\n";
        }

        ## make new seqobj and print it to file
        print "\tPrinting DNA seq for ".$name."\n\t~~~\n";
        my $seqobj = Bio::Seq->new( -display_id => $name, -seq => $seq->seq() );
        $seqio_out->write_seq($seqobj);

        ## check if input == output
        my @a = keys %display_id;
        push @check, $a[1];

    }
    ## remove tmp file
    unlink ($tmp_file);
}

## quick check if # inputs == # outputs
if ( scalar(@accessions) != scalar(@check) ) {
    print "\n\tWARNING: number of output seqs != number of input seqs (check -retmax flag perhaps?)\n\tThere were ".(@check)." seqs written to file\n\tThese ones not in output:\n";
my $lc = List::Compare->new(\@accessions, \@check);
    my @missing = $lc->get_unique();
    foreach (@missing) { print "\t".$_."\n" };
} else {
    print "\n\tThere were ".(@check)." seqs written to file - A-OK!\n";
}
print "\n\tFinished doing it, so it is\n\n";

Any comments or constructive criticisms are most welcome, as they will no doubt make my Perling better (it's pretty basic as you can see). It seems to works for me, clearly it might not work in all cases.

Thanks again for your time,

Reuben

score 1 · Answer 3 · 2013-02-27

The problem here is that you are using the protein accession for your Elink query, when you should be using the protein ID. This is easily demonstrated by changing line 5 in your code to read:

my @ids     = qw(71736864);

So to answer your later question: yes, it is necessary to go from accession to protein ID first, then from protein ID to gene ID.

You get from accession to protein ID using Esearch. See the EUtilities cookbook for details and example code.

I agree that the world of accessions, IDs etc. is very confusing. Basically, NCBI Entrez uses numeric identifiers for all of its databases (protein, gene, PubMed and so on). They are often given slightly different names such as UID, GI, PMID. Basically, though, they are all "IDs". So a gene ID is simply the ID used in the Gene database. A GI might be the term used for an ID in the nucleotide database.

Finally I note that when testing the code, I get different IDs to those in your example:

  Link name: protein_gene
Protein IDs: 71736864
   Gene IDs: 3559251