Thanks very much for your help!
I eventually worked it out, using eSearch followed by eLink followed by eFetch to get the corresponding CDS (as both posters suggested above).
My think my confusion stemmed primarily from not really understanding what exactly GI's, gene ID's, accessions, UID's etc etc were, in terms of what was appropriate to use as an initial query. For example, I couldn't understand why (in my initial question) using an accession as a query and setting '-dbfrom => protein' and 'db => gene' didn't get me what I needed. But I now understand that both the protein seq and its corresponding CDS have unique GI numbers, even though they refer to the same biological entity (you know what I mean!).
Anyway, I cobbled together the following from scraps of the EUtilites Cookbook and it seems to do what I want, in case it's ever of use for someone else. It takes a multi fasta file of proteins with accessions (ie >YP_273711.1 as fasta header) and returns a multi fasta of corresponding CDS's, where the fasta header is made up of 4 bits of info:
locusname | proteinaccession | protienGI# | geneid#
i.e. hopefully everything I'll ever need!
Here it is:
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Bio::DB::EUtilities;
use List::Compare;
print "\n\t#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#\n\t# eutils_DNAFromProteinAccession.pl #\n\t#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#\n\n";
my $seqio_in = Bio::SeqIO->new( -file => $ARGV[0], -format => 'fasta' );
my $outfile = $ARGV[0];
$outfile =~ s/.fasta/_DNA.fasta/;
my $seqio_out = Bio::SeqIO->new( -file => ">".$outfile, -format => 'fasta' );
my @accessions;
while ( my $seq = $seqio_in->next_seq() ) {
if ( $seq->display_id() =~ m/\|/) {
my $header = $seq->display_id();
my @a = split (/\|/, $header);
push ( @accessions, $a[1] );
} else {
push ( @accessions, $seq->display_id() );
}
}
print "\tThere are ".@accessions." sequences in input file: '".$ARGV[0]."'\n\n";
print "\tGetting GI's from accessions... ";
my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
-email => 'email@email.com',
-db => 'protein',
-term => join(',',@accessions),
-retmax => 500 );
my @gis = $factory->get_ids;
print scalar(@gis)." found\n";
$factory->reset_parameters(-eutil => 'elink',
-email => 'email@email.com',
-db => 'gene',
-dbfrom => 'protein',
-id => \@gis,
-retmax => 500 );
my @gene_ids;
print "\tGetting Gene ID's from GI's...\n";
while (my $ds = $factory->next_LinkSet) {
push (@gene_ids, $ds->get_ids);
}
print "\n\t".scalar(@gene_ids)." gene ID's found\n\n\tMapping details:\n\tacc <-> gi <-> gene_id\n";
for my $i ( 0 .. $
print "\t".$accessions[$i]." <-> ".$gis[$i]." <-> ".$gene_ids[$i]."\n";
}
my $eutil = Bio::DB::EUtilities->new(-eutil => 'esummary',
-email => 'email@email.com',
-db => 'gene',
-id => \@gene_ids,
-retmax => 500 );
my $fetcher = Bio::DB::EUtilities->new(-eutil => 'efetch',
-email => 'email@email.com',
-db => 'nucleotide',
-rettype => 'gb',
-retmax => 500 );
print "\n\tGetting sequence data...\n\n";
my @check;
while (my $docsum = $eutil->next_DocSum) {
my ($item) = $docsum->get_Items_by_name('GenomicInfoType');
my %item_data = map {$_ => 0} qw(ChrAccVer ChrStart ChrStop);
while (my $sub_item = $item->next_subItem) {
if (exists $item_data{$sub_item->get_name}) {
$item_data{$sub_item->get_name} = $sub_item->get_content;
}
}
for my $check (qw(ChrAccVer ChrStart ChrStop)) {
die "\t$check not set" unless $item_data{$check};
}
my $strand = $item_data{ChrStart} > $item_data{ChrStop} ? 2 : 1;
printf("\tRetrieving %s, from %d-%d, strand %d\n", $item_data{ChrAccVer},
$item_data{ChrStart},
$item_data{ChrStop},
$strand
);
my $length = $item_data{ChrStart} - $item_data{ChrStop};
if ( $length < 0 ) {
$length = $length * -1;
}
print "\tLength of seq: ".$length." bp\n";
$fetcher->set_parameters(-id => $item_data{ChrAccVer},
-seq_start => $item_data{ChrStart} + 1,
-seq_stop => $item_data{ChrStop} + 1,
-strand => $strand);
my $tmp_file = 'tmp.gb';
$fetcher->get_Response( -file => $tmp_file );
my $seqin = Bio::SeqIO->new(-file => $tmp_file, -format => 'genbank');
my ($organism, $gene_product);
while ( my $seq = $seqin->next_seq ) {
my %display_id;
for my $feat_object ($seq->get_SeqFeatures) {
my $length2 = ($feat_object->end) - ($feat_object->start);
if ($length2 == $length) {
for my $tag ($feat_object->get_all_tags) {
for my $value ($feat_object->get_tag_values($tag)) {
if ($tag =~ m/^locus_tag/) {
$display_id{$value}=();
} elsif ($tag =~ m/^protein_id/) {
$display_id{$value}=();
} elsif ($tag =~ m/organism/) {
$organism = $value;
} elsif ($tag =~ m/product/) {
$gene_product = $value;
}
if (($value =~ m/^GI:/) or ($value =~ m/^GeneID:/)) {
$display_id{$value}=();
}
}
}
}
}
$gene_product =~ s/\s+/_/g;
$organism =~ s/\s+/_/g;
$organism =~ s/\.//g;
my $name = join ('|', sort {$b cmp $a} keys %display_id)."|".$gene_product."|".$organism;
foreach (sort {$b cmp $a} keys %display_id) {
print "\t\t".$_."\n";
}
print "\tPrinting DNA seq for ".$name."\n\t~~~\n";
my $seqobj = Bio::Seq->new( -display_id => $name, -seq => $seq->seq() );
$seqio_out->write_seq($seqobj);
my @a = keys %display_id;
push @check, $a[1];
}
unlink ($tmp_file);
}
if ( scalar(@accessions) != scalar(@check) ) {
print "\n\tWARNING: number of output seqs != number of input seqs (check -retmax flag perhaps?)\n\tThere were ".(@check)." seqs written to file\n\tThese ones not in output:\n";
my $lc = List::Compare->new(\@accessions, \@check);
my @missing = $lc->get_unique();
foreach (@missing) { print "\t".$_."\n" };
} else {
print "\n\tThere were ".(@check)." seqs written to file - A-OK!\n";
}
print "\n\tFinished doing it, so it is\n\n";
Any comments or constructive criticisms are most welcome, as they will no doubt make my Perling better (it's pretty basic as you can see). It seems to works for me, clearly it might not work in all cases.
Thanks again for your time,
Reuben
Tip: EUtils requires only something that looks like an email address, as opposed to your real one :)