How To Bulk Retrieve Dna Fasta Given Gene Uids?
2
0
Entering edit mode
10.9 years ago
random26078 ▴ 20

I have read the post here Program or script to go from partial protein sequence to DNA and read through the eutilities manual but I am unsure how to use the 'history' parameter with my data which I understand that I need to use because I want to retrieve a huge number of nucleotide sequences.

I have a file of > 800,000 UIDs with one gene UID per line. I want to retrieve the DNA fasta file from NCBI for each UID.

Here is what I have tried. This works on a short list but my list of over 800,000 UIDs is too big. I don't understand where to include the history parameters for my search and retrieval which I believe will allow me to download all of the fasta files I need. Can anyone tell me how I should incorporate this in my script?

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::DB::EUtilities;

# this needs to be a list of EntrezGene unique IDs
my @ids = qw(3974211 1343189 2657315 7871523 6394675 3289281 11593755 11702744 12745110 12932362);

#open my $UIDLIST, '<', 'E:/uid-list.txt';
open my $OUT, '>', 'E:/gene-uid-list.fasta';

my $eutil   = Bio::DB::EUtilities->new(-eutil => 'esummary',
                                       -email => 'ma@b.org',
                                   -db    => 'gene',
                                   -id    => \@ids);

my $fetcher = Bio::DB::EUtilities->new(-eutil   => 'efetch',
                                   -email => 'ma@b.org',
                                   -db      => 'nucleotide',
                                   -rettype => 'fasta');

while (my $docsum = $eutil->next_DocSum) {
    # to ensure we grab the right ChrStart information, we grab the Item above
    # it in the Item hierarchy (visible via print_all from the eutil instance).
    my ($item) = $docsum->get_Items_by_name('GenomicInfoType');
    my %item_data = map {$_ => 0} qw(ChrAccVer ChrStart ChrStop);

    while (my $sub_item = $item->next_subItem) {
        if (exists $item_data{$sub_item->get_name}) {
            $item_data{$sub_item->get_name} = $sub_item->get_content;
        }
    }
    # check to make sure everything is set
    for my $check (qw(ChrAccVer ChrStart ChrStop)) {
        die "$check not set" unless $item_data{$check};
    }

    my $strand = $item_data{ChrStart} > $item_data{ChrStop} ? 2 : 1;
    printf("Retrieving %s, from %d-%d, strand %d\n", $item_data{ChrAccVer},
                                                $item_data{ChrStart},
                                                $item_data{ChrStop},
                                                $strand
                                                );

    $fetcher->set_parameters(-id => $item_data{ChrAccVer},
                         -seq_start => $item_data{ChrStart} + 1,
                         -seq_stop  => $item_data{ChrStop} + 1,
                         -strand    => $strand);
    #print $OUT $fetcher->get_Response->content;
}
close $OUT; #close $UIDLIST; exit;
bioperl perl eutils gene • 4.0k views
ADD COMMENT
1
Entering edit mode

Have you read the Bioperl EUtilities Cookbook? In particular, the section "How do I retrieve a long list of sequences using a query?"

ADD REPLY
0
Entering edit mode

Yes, I have read that section. My problem with using a history is that all of my UIDs are in a file I have locally. I am not retrieving the UIDs through an efetch call.

ADD REPLY
0
Entering edit mode
10.9 years ago
random26078 ▴ 20

I made some adjustments to my script and have it working. My only problem now is that I keep getting Bad Gateway errors. My revised code is below. Does anyone have a better way to retrieve >800,000 DNA sequences from a list of UID gene ids?

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::DB::EUtilities;

#This script takes as input a list of UIDs, one per line
#and returns the DNA fasta sequence.

my @uid;

open my $OUT, '>', 'E:/gene-uid-list.fasta';
open my $UIDLIST, '<', 'E:/gene-uid-list.txt' or die "Could not open uid list file: $!\n";

while(<$UIDLIST>){
    my $line = $_;
    chomp $line;
    push @uid, $line;
    my $eutil   = Bio::DB::EUtilities->new(-eutil => 'esummary',
                                              -email => 'ma@b.org',
                                             -db    => 'gene',
                                             -id    => \@uid);

 my $fetcher = Bio::DB::EUtilities->new(-eutil   => 'efetch',
                                          -email => 'ma@b.org',
                                         -db      => 'nucleotide',
                                          -rettype => 'fasta');

while (my $docsum = $eutil->next_DocSum) {    # to ensure we grab the right ChrStart information, we grab the Item above it in the Item hierarchy (visible via print_all from the eutil instance).
    my ($item) = $docsum->get_Items_by_name('GenomicInfoType');
    my %item_data = map {$_ => 0} qw(ChrAccVer ChrStart ChrStop);    #creates a hash for each key (accession, start and stop) and sets their values to 0.

    while (my $sub_item = $item->next_subItem) {
        if (exists $item_data{$sub_item->get_name}) {
            $item_data{$sub_item->get_name} = $sub_item->get_content;    #gets the accession, chr start and chr stop coordinates and sets the value for the key.
        }
    }
# check to make sure everything is set
    for my $check (qw(ChrAccVer ChrStart ChrStop)) {
        die "$check not set" unless $item_data{$check};
    }

    my $strand = $item_data{ChrStart} > $item_data{ChrStop} ? 2 : 1;
    printf("Retrieving %s, from %d-%d, strand %d\n", $item_data{ChrAccVer},
                                                     $item_data{ChrStart},
                                                     $item_data{ChrStop},
                                                     $strand);

    $fetcher->set_parameters(-id => $item_data{ChrAccVer},
                             -seq_start => $item_data{ChrStart} + 1,
                              -seq_stop  => $item_data{ChrStop} + 1,
                              -strand    => $strand);
    print $OUT $fetcher->get_Response->content;
}
sleep 1;    #wait one second 
pop @uid;    #discard the current UID off of the array.
}

close $OUT;
close $UIDLIST;

exit;
ADD COMMENT
0
Entering edit mode

I think the idea is to send all of your IDs to the server and retrieve them in batches, but I may be misunderstanding (I'll try to find out if this is true). The code above is sending one ID at a time, so you would still be making >800,000 queries.

ADD REPLY
0
Entering edit mode
10.9 years ago
biostar ▴ 170

If you have stand-alone BLAST in your computer then you can do it easily using the following command..

blastdbcmd -db database_name -dbtype nucl/proc -entry_batch file_with_uid.txt
ADD COMMENT
0
Entering edit mode

Using the BLAST+ blastdbcmd is a good idea but won't work in my case. I am using protein (gene) UIDs and I need to get nucleotide FASTA sequences that those proteins come from.

ADD REPLY

Login before adding your answer.

Traffic: 1581 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6