Thanks very much for your help!
I eventually worked it out, using eSearch followed by eLink followed by eFetch to get the corresponding CDS (as both posters suggested above).
My think my confusion stemmed primarily from not really understanding what exactly GI's, gene ID's, accessions, UID's etc etc were, in terms of what was appropriate to use as an initial query. For example, I couldn't understand why (in my initial question) using an accession as a query and setting '-dbfrom => protein' and 'db => gene' didn't get me what I needed. But I now understand that both the protein seq and its corresponding CDS have unique GI numbers, even though they refer to the same biological entity (you know what I mean!).
Anyway, I cobbled together the following from scraps of the EUtilites Cookbook and it seems to do what I want, in case it's ever of use for someone else. It takes a multi fasta file of proteins with accessions (ie >YP_273711.1 as fasta header) and returns a multi fasta of corresponding CDS's, where the fasta header is made up of 4 bits of info:
locusname | proteinaccession | protienGI# | geneid#
i.e. hopefully everything I'll ever need!
Here it is:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Bio::DB::EUtilities;
use List::Compare;
## Takes a list of protein seqs, with accessions, and uses NCBI's Eutils to get
## the corresponding CDS nucleotide sequence.
print "\n\t#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#\n\t# eutils_DNAFromProteinAccession.pl #\n\t#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#\n\n";
## open IN and OUT
my $seqio_in = Bio::SeqIO->new( -file => $ARGV[0], -format => 'fasta' );
my $outfile = $ARGV[0];
$outfile =~ s/.fasta/_DNA.fasta/;
my $seqio_out = Bio::SeqIO->new( -file => ">".$outfile, -format => 'fasta' );
## get list of accessions
my @accessions;
while ( my $seq = $seqio_in->next_seq() ) {
if ( $seq->display_id() =~ m/\|/) {
my $header = $seq->display_id();
my @a = split (/\|/, $header);
push ( @accessions, $a[1] );
} else {
push ( @accessions, $seq->display_id() );
}
}
print "\tThere are ".@accessions." sequences in input file: '".$ARGV[0]."'\n\n";
print "\tGetting GI's from accessions... ";
## use eSearch to convert accs into GI's. the trick here is to pass eSearch -term
## a COMMA-SEPERATED list using join:
my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
-email => 'email@email.com',
-db => 'protein',
-term => join(',',@accessions),
-retmax => 500 );
## get GI's
my @gis = $factory->get_ids;
print scalar(@gis)." found\n";
## use eLink to get gene IDs from GI's
$factory->reset_parameters(-eutil => 'elink',
-email => 'email@email.com',
-db => 'gene',
-dbfrom => 'protein',
-id => \@gis,
-retmax => 500 );
my @gene_ids;
print "\tGetting Gene ID's from GI's...\n";
## iterate through the LinkSet objects
while (my $ds = $factory->next_LinkSet) {
# print "\t Link name: ",$ds->get_link_name,"\n";
# print "\tProtein IDs (GI): ",join(', ',$ds->get_submitted_ids),"\n";
# print "\t Gene IDs: ",join(', ',$ds->get_ids),"\n";
## get gene ID's
push (@gene_ids, $ds->get_ids);
}
print "\n\t".scalar(@gene_ids)." gene ID's found\n\n\tMapping details:\n\tacc <-> gi <-> gene_id\n";
for my $i ( 0 .. $#accessions ) {
print "\t".$accessions[$i]." <-> ".$gis[$i]." <-> ".$gene_ids[$i]."\n";
}
my $eutil = Bio::DB::EUtilities->new(-eutil => 'esummary',
-email => 'email@email.com',
-db => 'gene',
-id => \@gene_ids,
-retmax => 500 );
my $fetcher = Bio::DB::EUtilities->new(-eutil => 'efetch',
-email => 'email@email.com',
-db => 'nucleotide',
-rettype => 'gb',
-retmax => 500 );
print "\n\tGetting sequence data...\n\n";
my @check;
while (my $docsum = $eutil->next_DocSum) {
## to ensure we grab the right ChrStart information, we grab the Item above
## it in the Item hierarchy (visible via print_all from the eutil instance)
my ($item) = $docsum->get_Items_by_name('GenomicInfoType');
my %item_data = map {$_ => 0} qw(ChrAccVer ChrStart ChrStop);
while (my $sub_item = $item->next_subItem) {
if (exists $item_data{$sub_item->get_name}) {
$item_data{$sub_item->get_name} = $sub_item->get_content;
}
}
## check to make sure everything is set
for my $check (qw(ChrAccVer ChrStart ChrStop)) {
die "\t$check not set" unless $item_data{$check};
}
my $strand = $item_data{ChrStart} > $item_data{ChrStop} ? 2 : 1;
printf("\tRetrieving %s, from %d-%d, strand %d\n", $item_data{ChrAccVer},
$item_data{ChrStart},
$item_data{ChrStop},
$strand
);
my $length = $item_data{ChrStart} - $item_data{ChrStop};
if ( $length < 0 ) {
$length = $length * -1;
}
print "\tLength of seq: ".$length." bp\n";
$fetcher->set_parameters(-id => $item_data{ChrAccVer},
-seq_start => $item_data{ChrStart} + 1,
-seq_stop => $item_data{ChrStop} + 1,
-strand => $strand);
## dump into a temp genbank file
my $tmp_file = 'tmp.gb';
$fetcher->get_Response( -file => $tmp_file );
my $seqin = Bio::SeqIO->new(-file => $tmp_file, -format => 'genbank');
my ($organism, $gene_product);
## get some cross-link information for each seq, and incorporate it into display_id
while ( my $seq = $seqin->next_seq ) {
my %display_id;
for my $feat_object ($seq->get_SeqFeatures) {
## get length of current SeqFeature
my $length2 = ($feat_object->end) - ($feat_object->start);
## IMPORTANT ##
## because of overlapping CDS in bacterial genomes,
## lots of GenBank files have >1 CDS in them, and more than one protein
## product etc. By matching up the length of the retrieved seq with the
## expected length found within the gb file, we ensure that the correct
## information is retrieved and passed to $name
if ($length2 == $length) {
for my $tag ($feat_object->get_all_tags) {
for my $value ($feat_object->get_tag_values($tag)) {
## get locus_tag, protein_id, $organism and $gene_product
if ($tag =~ m/^locus_tag/) {
$display_id{$value}=();
} elsif ($tag =~ m/^protein_id/) {
$display_id{$value}=();
} elsif ($tag =~ m/organism/) {
$organism = $value;
} elsif ($tag =~ m/product/) {
$gene_product = $value;
}
## get both the GI and GeneID
if (($value =~ m/^GI:/) or ($value =~ m/^GeneID:/)) {
$display_id{$value}=();
}
}
}
}
}
$gene_product =~ s/\s+/_/g;
$organism =~ s/\s+/_/g;
$organism =~ s/\.//g;
## make the seq name out of info gathered from above
my $name = join ('|', sort {$b cmp $a} keys %display_id)."|".$gene_product."|".$organism;
foreach (sort {$b cmp $a} keys %display_id) {
print "\t\t".$_."\n";
}
## make new seqobj and print it to file
print "\tPrinting DNA seq for ".$name."\n\t~~~\n";
my $seqobj = Bio::Seq->new( -display_id => $name, -seq => $seq->seq() );
$seqio_out->write_seq($seqobj);
## check if input == output
my @a = keys %display_id;
push @check, $a[1];
}
## remove tmp file
unlink ($tmp_file);
}
## quick check if # inputs == # outputs
if ( scalar(@accessions) != scalar(@check) ) {
print "\n\tWARNING: number of output seqs != number of input seqs (check -retmax flag perhaps?)\n\tThere were ".(@check)." seqs written to file\n\tThese ones not in output:\n";
my $lc = List::Compare->new(\@accessions, \@check);
my @missing = $lc->get_unique();
foreach (@missing) { print "\t".$_."\n" };
} else {
print "\n\tThere were ".(@check)." seqs written to file - A-OK!\n";
}
print "\n\tFinished doing it, so it is\n\n";
Any comments or constructive criticisms are most welcome, as they will no doubt make my Perling better (it's pretty basic as you can see). It seems to works for me, clearly it might not work in all cases.
Thanks again for your time,
Reuben
Tip: EUtils requires only something that looks like an email address, as opposed to your real one :)