Hello there,
I am trying to retrieve all sequences for a particular taxon from the NCBI. I found a script in a guide published by the NCBI on using entrez e-utilities: http://www.ncbi.nlm.nih.gov/books/NBK25498/ (application #3). I will paste the script below.
When I run the script I expect to get >50,000 chimp sequences, but I don't. In fact, each time I run it I get a different number of returned sequences. The same occurs when I try and get other taxa. Any ideas on why this script returns a different value each time? Is there another way to retrieve all sequences from a particular group?
I would appreciate any help anyone can offer. Thank you!
use LWP::Simple;
$query = 'chimpanzee[orgn]+AND+biomol+mrna[prop]';
#assemble the esearch URL
$base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "esearch.fcgi?db=nucleotide&term=$query&usehistory=y";
#post the esearch URL
$output = get($url);
#parse WebEnv, QueryKey and Count (# records retrieved)
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
$count = $1 if ($output =~ /<Count>(\d+)<\/Count>/);
#open output file for writing
open(OUT, ">chimp.fna") || die "Can't open file!\n";
#retrieve data in batches of 500
$retmax = 500;
for ($retstart = 0; $retstart < $count; $retstart += $retmax) {
$efetch_url = $base ."efetch.fcgi?db=nucleotide&WebEnv=$web";
$efetch_url .= "&query_key=$key&retstart=$retstart";
$efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
$efetch_out = get($efetch_url);
print OUT "$efetch_out";
}
close OUT;