I have read the post here Program or script to go from partial protein sequence to DNA and read through the eutilities manual but I am unsure how to use the 'history' parameter with my data which I understand that I need to use because I want to retrieve a huge number of nucleotide sequences.
I have a file of > 800,000 UIDs with one gene UID per line. I want to retrieve the DNA fasta file from NCBI for each UID.
Here is what I have tried. This works on a short list but my list of over 800,000 UIDs is too big. I don't understand where to include the history parameters for my search and retrieval which I believe will allow me to download all of the fasta files I need. Can anyone tell me how I should incorporate this in my script?
#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::DB::EUtilities;
# this needs to be a list of EntrezGene unique IDs
my @ids = qw(3974211 1343189 2657315 7871523 6394675 3289281 11593755 11702744 12745110 12932362);
#open my $UIDLIST, '<', 'E:/uid-list.txt';
open my $OUT, '>', 'E:/gene-uid-list.fasta';
my $eutil = Bio::DB::EUtilities->new(-eutil => 'esummary',
-email => 'ma@b.org',
-db => 'gene',
-id => \@ids);
my $fetcher = Bio::DB::EUtilities->new(-eutil => 'efetch',
-email => 'ma@b.org',
-db => 'nucleotide',
-rettype => 'fasta');
while (my $docsum = $eutil->next_DocSum) {
# to ensure we grab the right ChrStart information, we grab the Item above
# it in the Item hierarchy (visible via print_all from the eutil instance).
my ($item) = $docsum->get_Items_by_name('GenomicInfoType');
my %item_data = map {$_ => 0} qw(ChrAccVer ChrStart ChrStop);
while (my $sub_item = $item->next_subItem) {
if (exists $item_data{$sub_item->get_name}) {
$item_data{$sub_item->get_name} = $sub_item->get_content;
}
}
# check to make sure everything is set
for my $check (qw(ChrAccVer ChrStart ChrStop)) {
die "$check not set" unless $item_data{$check};
}
my $strand = $item_data{ChrStart} > $item_data{ChrStop} ? 2 : 1;
printf("Retrieving %s, from %d-%d, strand %d\n", $item_data{ChrAccVer},
$item_data{ChrStart},
$item_data{ChrStop},
$strand
);
$fetcher->set_parameters(-id => $item_data{ChrAccVer},
-seq_start => $item_data{ChrStart} + 1,
-seq_stop => $item_data{ChrStop} + 1,
-strand => $strand);
#print $OUT $fetcher->get_Response->content;
}
close $OUT; #close $UIDLIST; exit;
Have you read the Bioperl EUtilities Cookbook? In particular, the section "How do I retrieve a long list of sequences using a query?"
Yes, I have read that section. My problem with using a history is that all of my UIDs are in a file I have locally. I am not retrieving the UIDs through an efetch call.