I want to be able to retrieve all refseq fasta sequences for a list of organisms. I am using the code below from the EUtilities cookbook which works for a single organism name.
use Bio::Perl;
use Bio::DB::EUtilities;
use Bio::DB::RefSeq;
# set optional history queue
my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
-email => 'me@foo.com',
-db => 'nuccore',
-term => 'Prevotella[ORGN] AND srcdb_refseq[PROP]',
-usehistory => 'y');
my $count = $factory->get_count;
# get history from queue
my $hist = $factory->next_History || die 'No history data returned';
print "History returned\n";
# note db carries over from above
$factory->set_parameters(-eutil => 'efetch',
-rettype => 'fasta',
-history => $hist);
my $retry = 0;
my ($retmax, $retstart) = (500,0);
open (my $out, '>', 'test_list_seqs.fa') || die "Can't open file:$!";
RETRIEVE_SEQS:
while ($retstart < $count) {
$factory->set_parameters(-retmax => $retmax,
-retstart => $retstart);
eval{
$factory->get_Response(-cb => sub {my ($data) = @_; print $out $data} );
};
if ($@) {
die "Server error: $@. Try again later" if $retry == 5;
print STDERR "Server error, redo #$retry\n";
$retry++ && redo RETRIEVE_SEQS;
}
print "Retrieved $retstart";
$retstart += $retmax;
}
close $out;
` I have attempted to read an organism name list from a file and retrieve sequences that way but that is giving me what I think may be the entire genbank fasta sequences from any database and any organism. Has anyone done a bulk retrieval before? Here is my attempt to get sequences from an organism name file.:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::DB::EUtilities;
use Bio::DB::RefSeq;
open(INFILE,"<C:/Users/workspace/general/test_organism_name.txt") or die "Cannot open infile: $!";
while(<INFILE>){
chomp;
my $name = $_;
my $term = '"$name",[ORGN] AND srcdb_refseq[PROP]';
# set optional history queue
my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
-email => 'me@foo.com',
-db => 'nuccore',
-term => '$term',
-usehistory => 'y');
my $count = $factory->get_count;
# get history from queue
my $hist = $factory->next_History || die 'No history data returned';
print "History returned\n";
# note db carries over from above
$factory->set_parameters(-eutil => 'efetch',
-rettype => 'fasta',
-history => $hist);
my $retry = 0;
my ($retmax, $retstart) = (500,0);
open (my $out, '>', 'test_list_seqs.fa') || die "Can't open file:$!";
RETRIEVE_SEQS:
while ($retstart < $count) {
$factory->set_parameters(-retmax => $retmax,
-retstart => $retstart);
eval{
$factory->get_Response(-cb => sub {my ($data) = @_; print $out $data} );
};
if ($@) {
die "Server error: $@. Try again later" if $retry == 5;
print STDERR "Server error, redo #$retry\n";
$retry++ && redo RETRIEVE_SEQS;
}
print "Retrieved $retstart";
$retstart += $retmax;
}
close $out;
}
close INFILE;
Thanks!
it has been a while since I last programmed in Perl but it looks to me that you have a comma as well double quotes in this
"$name",[ORGN] AND srcdb_refseq[PROP]
Istvan, you are correct, I don't need the double quotes and the comma. I now have this line as: "
Has anyone gotten the DNA sequences by organism name for a list of several organism names using a script? I am willing to try other environments.
make sure to print and test your query as an EUtil url as well. You can do this in a much simpler way, the perl script that you have is needlessly complicated.