How Can I Bulk Retrieve Refseq Fasta Files Based On Organism Name Using Eutilities?
2
0
Entering edit mode
12.2 years ago
random26078 ▴ 20

I want to be able to retrieve all refseq fasta sequences for a list of organisms. I am using the code below from the EUtilities cookbook which works for a single organism name.

use Bio::Perl;
use Bio::DB::EUtilities;
use Bio::DB::RefSeq;

# set optional history queue
my $factory = Bio::DB::EUtilities->new(-eutil      => 'esearch',
                                       -email      => 'me@foo.com',
                                       -db         => 'nuccore',
                                       -term       => 'Prevotella[ORGN] AND srcdb_refseq[PROP]',
                                       -usehistory => 'y');

my $count = $factory->get_count;
# get history from queue
my $hist  = $factory->next_History || die 'No history data returned';
print "History returned\n";
# note db carries over from above
$factory->set_parameters(-eutil   => 'efetch',
                         -rettype => 'fasta',
                         -history => $hist);

my $retry = 0;
my ($retmax, $retstart) = (500,0);

open (my $out, '>', 'test_list_seqs.fa') || die "Can't open file:$!";

RETRIEVE_SEQS:
while ($retstart < $count) {
    $factory->set_parameters(-retmax   => $retmax,
                             -retstart => $retstart);
    eval{
        $factory->get_Response(-cb => sub {my ($data) = @_; print $out $data} );
    };
    if ($@) {
        die "Server error: $@.  Try again later" if $retry == 5;
        print STDERR "Server error, redo #$retry\n";
        $retry++ && redo RETRIEVE_SEQS;
    }
    print "Retrieved $retstart";
    $retstart += $retmax;
}
close $out;

` I have attempted to read an organism name list from a file and retrieve sequences that way but that is giving me what I think may be the entire genbank fasta sequences from any database and any organism. Has anyone done a bulk retrieval before? Here is my attempt to get sequences from an organism name file.:

 #!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::DB::EUtilities;
use Bio::DB::RefSeq;

open(INFILE,"<C:/Users/workspace/general/test_organism_name.txt") or die "Cannot open infile: $!";

while(<INFILE>){
    chomp;
    my $name = $_;
                 my $term = '"$name",[ORGN] AND srcdb_refseq[PROP]';

# set optional history queue
my $factory = Bio::DB::EUtilities->new(-eutil      => 'esearch',
                                       -email      => 'me@foo.com',
                                       -db         => 'nuccore',
                                       -term       =>  '$term',
                                       -usehistory => 'y');

my $count = $factory->get_count;
# get history from queue
my $hist  = $factory->next_History || die 'No history data returned';
print "History returned\n";
# note db carries over from above
$factory->set_parameters(-eutil   => 'efetch',
                         -rettype => 'fasta',
                         -history => $hist);

my $retry = 0;
my ($retmax, $retstart) = (500,0);

open (my $out, '>', 'test_list_seqs.fa') || die "Can't open file:$!";

RETRIEVE_SEQS:
while ($retstart < $count) {
    $factory->set_parameters(-retmax   => $retmax,
                             -retstart => $retstart);
    eval{
        $factory->get_Response(-cb => sub {my ($data) = @_; print $out $data} );
    };
    if ($@) {
        die "Server error: $@.  Try again later" if $retry == 5;
        print STDERR "Server error, redo #$retry\n";
        $retry++ && redo RETRIEVE_SEQS;
    }
    print "Retrieved $retstart";
    $retstart += $retmax;
}
close $out;
}
close INFILE;

Thanks!

bioperl refseq • 4.6k views
ADD COMMENT
0
Entering edit mode

it has been a while since I last programmed in Perl but it looks to me that you have a comma as well double quotes in this "$name",[ORGN] AND srcdb_refseq[PROP]

ADD REPLY
0
Entering edit mode

Istvan, you are correct, I don't need the double quotes and the comma. I now have this line as: "

$name+AND+srcdb_refseq[PROP]"  but this now reports an error:  
MSG: Response Error
Server closed connection without sending any data back
STACK Bio::DB::GenericWebAgent::get_Response C:/Perl64/site/lib/Bio/DB/GenericWebAgent.pm:215
STACK Bio::DB::EUtilities::get_Parser C:/Perl64/site/lib/Bio/DB/EUtilities.pm:222
STACK Bio::DB::EUtilities::get_count C:/Perl64/site/lib/Bio/DB/EUtilities.pm:580

Has anyone gotten the DNA sequences by organism name for a list of several organism names using a script? I am willing to try other environments.

ADD REPLY
0
Entering edit mode

make sure to print and test your query as an EUtil url as well. You can do this in a much simpler way, the perl script that you have is needlessly complicated.

ADD REPLY
1
Entering edit mode
12.2 years ago
random26078 ▴ 20

Istvan's comment that the Eutilities example script I was trying to fit to my needs was too complicated made me look at the Eutilties guide here: NCBI Eutilities Quick Start Guide Which led me in a new direction that allowed me to solve my problem. I am posting the solution below in case anyone is interested. My organism name sequences file contains one organism name per line.

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;

open (my $OUT, '>', '/home/Test/test_organism_seqs.fa') || die "Can't open file:$!";

open(INFILE,"</home>){
    chomp;
    my $name = $_;
    $name.="[ORGN]";

my $db = 'nuccore';
my $query = "$name+AND+srcdb_refseq[PROP]";

#base URL
my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $url= $base . "esearch.fcgi?db=$db&term=$query&usehistory=y";

#Run the search using the URL created above
my $output = get($url);

#Web Environment. This parameter specifies the Web Environment that 
#contains the UID list to be provided as input to ESummary. Usually 
#this WebEnv value is obtained from the output of a previous ESearch, 
#EPost or ELink call. The WebEnv parameter must be used in 
#conjunction with query_key.
my $web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);

#Query key. This integer specifies which of the UID lists attached to the given 
#Web Environment will be used as input to ESummary.  Query keys are obtained
# from the output of previous ESearch, EPost or ELink calls.  The query_key 
#parameter must be used in conjunction with WebEnv.
my $key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);

$url = $base . "esummary.fcgi?db=$db&query_key=$key&WebEnv=$web";

#Run the search using the esummary URL created above
my $docsums = get($url);

$url = $base . "efetch.fcgi?db=$db&query_key=$key&WebEnv=$web";
$url.= "&rettype=fasta&retmode=text";

#Run the search using the efetch URL created above.
my $data = get($url);
print $OUT "$data";
}

close $OUT;
exit;

Thanks for leading my thoughts to another solution!

ADD COMMENT
0
Entering edit mode

thanks for posting the solution - yours is much simpler than that from that guide. Goes to show that simple things work best.

ADD REPLY
0
Entering edit mode
12.2 years ago

I will add my comment to mark the question as answered:


It has been a while since I last programmed in Perl but it looks to me that you have a comma as well double quotes in this "$name",[ORGN] AND srcdb_refseq[PROP]

ADD COMMENT

Login before adding your answer.

Traffic: 1808 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6