Question

How Can I Bulk Retrieve Refseq Fasta Files Based On Organism Name Using Eutilities?

0

Entering edit mode

12.9 years ago

random26078 ▴ 20

I want to be able to retrieve all refseq fasta sequences for a list of organisms. I am using the code below from the EUtilities cookbook which works for a single organism name.

use Bio::Perl;
use Bio::DB::EUtilities;
use Bio::DB::RefSeq;

# set optional history queue
my $factory = Bio::DB::EUtilities->new(-eutil      => 'esearch',
                                       -email      => 'me@foo.com',
                                       -db         => 'nuccore',
                                       -term       => 'Prevotella[ORGN] AND srcdb_refseq[PROP]',
                                       -usehistory => 'y');

my $count = $factory->get_count;
# get history from queue
my $hist  = $factory->next_History || die 'No history data returned';
print "History returned\n";
# note db carries over from above
$factory->set_parameters(-eutil   => 'efetch',
                         -rettype => 'fasta',
                         -history => $hist);

my $retry = 0;
my ($retmax, $retstart) = (500,0);

open (my $out, '>', 'test_list_seqs.fa') || die "Can't open file:$!";

RETRIEVE_SEQS:
while ($retstart < $count) {
    $factory->set_parameters(-retmax   => $retmax,
                             -retstart => $retstart);
    eval{
        $factory->get_Response(-cb => sub {my ($data) = @_; print $out $data} );
    };
    if ($@) {
        die "Server error: $@.  Try again later" if $retry == 5;
        print STDERR "Server error, redo #$retry\n";
        $retry++ && redo RETRIEVE_SEQS;
    }
    print "Retrieved $retstart";
    $retstart += $retmax;
}
close $out;

` I have attempted to read an organism name list from a file and retrieve sequences that way but that is giving me what I think may be the entire genbank fasta sequences from any database and any organism. Has anyone done a bulk retrieval before? Here is my attempt to get sequences from an organism name file.:

 #!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::DB::EUtilities;
use Bio::DB::RefSeq;

open(INFILE,"<C:/Users/workspace/general/test_organism_name.txt") or die "Cannot open infile: $!";

while(<INFILE>){
    chomp;
    my $name = $_;
                 my $term = '"$name",[ORGN] AND srcdb_refseq[PROP]';

# set optional history queue
my $factory = Bio::DB::EUtilities->new(-eutil      => 'esearch',
                                       -email      => 'me@foo.com',
                                       -db         => 'nuccore',
                                       -term       =>  '$term',
                                       -usehistory => 'y');

my $count = $factory->get_count;
# get history from queue
my $hist  = $factory->next_History || die 'No history data returned';
print "History returned\n";
# note db carries over from above
$factory->set_parameters(-eutil   => 'efetch',
                         -rettype => 'fasta',
                         -history => $hist);

my $retry = 0;
my ($retmax, $retstart) = (500,0);

open (my $out, '>', 'test_list_seqs.fa') || die "Can't open file:$!";

RETRIEVE_SEQS:
while ($retstart < $count) {
    $factory->set_parameters(-retmax   => $retmax,
                             -retstart => $retstart);
    eval{
        $factory->get_Response(-cb => sub {my ($data) = @_; print $out $data} );
    };
    if ($@) {
        die "Server error: $@.  Try again later" if $retry == 5;
        print STDERR "Server error, redo #$retry\n";
        $retry++ && redo RETRIEVE_SEQS;
    }
    print "Retrieved $retstart";
    $retstart += $retmax;
}
close $out;
}
close INFILE;

Thanks!

bioperl refseq • 4.8k views

ADD COMMENT • link 12.9 years ago by random26078 ▴ 20

0

Entering edit mode

it has been a while since I last programmed in Perl but it looks to me that you have a comma as well double quotes in this "$name",[ORGN] AND srcdb_refseq[PROP]

ADD REPLY • link 12.9 years ago by Istvan Albert 102k

0

Entering edit mode

Istvan, you are correct, I don't need the double quotes and the comma. I now have this line as: "

$name+AND+srcdb_refseq[PROP]"  but this now reports an error:  
MSG: Response Error
Server closed connection without sending any data back
STACK Bio::DB::GenericWebAgent::get_Response C:/Perl64/site/lib/Bio/DB/GenericWebAgent.pm:215
STACK Bio::DB::EUtilities::get_Parser C:/Perl64/site/lib/Bio/DB/EUtilities.pm:222
STACK Bio::DB::EUtilities::get_count C:/Perl64/site/lib/Bio/DB/EUtilities.pm:580

Has anyone gotten the DNA sequences by organism name for a list of several organism names using a script? I am willing to try other environments.

ADD REPLY • link updated 12.9 years ago by Istvan Albert 102k • written 12.9 years ago by random26078 ▴ 20

0

Entering edit mode

make sure to print and test your query as an EUtil url as well. You can do this in a much simpler way, the perl script that you have is needlessly complicated.

ADD REPLY • link 12.9 years ago by Istvan Albert 102k

score 1 · Answer 1 · 2012-09-14

Istvan's comment that the Eutilities example script I was trying to fit to my needs was too complicated made me look at the Eutilties guide here: NCBI Eutilities Quick Start Guide Which led me in a new direction that allowed me to solve my problem. I am posting the solution below in case anyone is interested. My organism name sequences file contains one organism name per line.

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;

open (my $OUT, '>', '/home/Test/test_organism_seqs.fa') || die "Can't open file:$!";

open(INFILE,"</home>){
    chomp;
    my $name = $_;
    $name.="[ORGN]";

my $db = 'nuccore';
my $query = "$name+AND+srcdb_refseq[PROP]";

#base URL
my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $url= $base . "esearch.fcgi?db=$db&term=$query&usehistory=y";

#Run the search using the URL created above
my $output = get($url);

#Web Environment. This parameter specifies the Web Environment that 
#contains the UID list to be provided as input to ESummary. Usually 
#this WebEnv value is obtained from the output of a previous ESearch, 
#EPost or ELink call. The WebEnv parameter must be used in 
#conjunction with query_key.
my $web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);

#Query key. This integer specifies which of the UID lists attached to the given 
#Web Environment will be used as input to ESummary.  Query keys are obtained
# from the output of previous ESearch, EPost or ELink calls.  The query_key 
#parameter must be used in conjunction with WebEnv.
my $key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);

$url = $base . "esummary.fcgi?db=$db&query_key=$key&WebEnv=$web";

#Run the search using the esummary URL created above
my $docsums = get($url);

$url = $base . "efetch.fcgi?db=$db&query_key=$key&WebEnv=$web";
$url.= "&rettype=fasta&retmode=text";

#Run the search using the efetch URL created above.
my $data = get($url);
print $OUT "$data";
}

close $OUT;
exit;

Thanks for leading my thoughts to another solution!

score 0 · Answer 2 · 2012-09-13

0

Entering edit mode

12.9 years ago

Istvan Albert 102k

I will add my comment to mark the question as answered:

It has been a while since I last programmed in Perl but it looks to me that you have a comma as well double quotes in this "$name",[ORGN] AND srcdb_refseq[PROP]

ADD COMMENT • link 12.9 years ago by Istvan Albert 102k