Question

Perl Script For Retrieval Of Gene Sequence (In Fasta Format) From Ncbi

1

Entering edit mode

13.9 years ago

Vars22 ▴ 20

Hello everybody,

My question is how I can retrieve a gene sequence by providing the geneID as user input from NCBI and then store it in an Excel sheet.

thank you,

regards: vars22

sequence retrieval fasta perl • 14k views

ADD COMMENT • link updated 11.3 years ago by Biostar 20 • written 13.9 years ago by Vars22 ▴ 20

9

Entering edit mode

Something does not make sense to me here. If you want to store the sequence in an Excel sheet (which I think is a bad idea to start with), why would you want the sequence to be in FASTA format?

ADD REPLY • link 13.9 years ago by Lars Juhl Jensen 11k

6

Entering edit mode

Answer to second part of question: you almost certainly do not want to store sequence data in Excel. You can prove me wrong by explaining the sequence analysis functionality in Excel that you plan to use.

ADD REPLY • link 13.9 years ago by Neilfws 49k

3

Entering edit mode

what did you try so far ?

ADD REPLY • link 13.9 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Additionally, if the goal is to retrieve sequences based on a certain user input, why is it crucial that this is done in Perl?

ADD REPLY • link 13.9 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

Might be an idea to output to .csv? Excel can still read it then and easier to parse programmatically?

ADD REPLY • link 13.9 years ago by Steve Moss 2.3k

0

Entering edit mode

i only want the sequences if it will be in fasta format i can easily extract the sequences.i have geneIDs with me of which i need sequences. i have used a module lwp and set the path of ncbi but i the result always shows connection timed out.

ADD REPLY • link 13.9 years ago by Vars22 ▴ 20

0

Entering edit mode

thanks for the help i m grateful

ADD REPLY • link 13.9 years ago by Vars22 ▴ 20

0

Entering edit mode

plz suggest me something in perl not bioperl. the bioperl link you have sent me i have already been there.thanks for the other links.

ADD REPLY • link 13.9 years ago by Vars22 ▴ 20

Ram · Answer 1 · 2011-02-17

10

Entering edit mode

13.9 years ago

Peter 6.0k

For the first part, use the NCBI Entrez interface (perhaps via BioPerl since you want to use Perl).

For the second part about Excel, As others have said too, my answer is don't do it. For one set of reasons, see: Zeeberg et al (2004) Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 13.9 years ago by Peter 6.0k

2

Entering edit mode

++ for the reference :)

ADD REPLY • link 13.9 years ago by Simon Cockell 7.4k

score 4 · Answer 2 · 2011-02-17

Take a look at these links?

http://www.bioperl.org/wiki/HOWTO:Getting_Genomic_Sequences

http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#esummary_-.3E_efetch

Outputting to .csv would be better, if you really want to use Excel to view these. I have used Bio::DB::Fasta and Text::CSV_XS for similar stuff in the past!

You could get the sequence using Bio::DB::EUtilities as described here. I would build a Fasta sequence object from that using Bio::Seq before writing it to a Fasta file using Bio::SeqIO.

If you are determined to look at Excel then these links might be useful?

http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SeqIO/excel.html

http://www.ibm.com/developerworks/linux/library/l-pexcel/

Also, some stuff around page 567 in Bioinformatics: sequence and genome analysis (2nd Ed)

score 0 · Answer 3 · 2011-02-17

0

Entering edit mode

13.9 years ago

Julien ▴ 160

NCBI provides a script-generation website at:

http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/ebot/ebot.cgi

Follow the directions on the screen, and it will create a perl script for you. It's verbose and comes with a lot of disclaimers, but it is very easy to modify to your needs.

ADD COMMENT • link 13.9 years ago by Julien ▴ 160

Ram · Answer 4 · 2011-02-25

The code for retrieving gene sequence using gene IDs

#!/usr/bin/perl
use Bio::DB::EUtilities;
use Bio::SeqIO;

system(cls);

print "Enter the input file name: "; #ask user for the file name containing the gene IDs
$filename=[?];                   #take the file name as input
open(INPUT,$filename) or die "Cannot open input file";
@arrayids=[?];                   #store the Gene IDs in an array
close INPUT;

foreach $ea(@arrayids)
{
    chomp($ea);
}

$len=scalar(@arrayids);

for($i=0;$i<$len;$i++)
{
    $fi=$arrayids[$i];

    $file_nam[$i]=join('',$fi,".gbe"); #joins the genbank extension with the gene ID

    open(FILE,">$file_nam[$i]") or die "cannot open $file_nam....";

    my @ids=$arrayids[$i];
    my $eutil   = Bio::DB::EUtilities->new(-eutil => 'esummary',
                                           -email => 'user@xyz.com',
                                           -db    => 'gene',
                                           -id    => \@ids);

    my $fetcher = Bio::DB::EUtilities->new(-eutil   => 'efetch',
                                           -db      => 'nucleotide',
                                           -rettype => 'fasta');#returns fasta #sequence you can change the format acc to ur needs

    while (my $docsum = $eutil->next_DocSum) {
        my ($item) = $docsum->get_Items_by_name('GenomicInfoType');

        my %item_data = map {$_ => 0} qw(ChrAccVer ChrStart ChrStop);

        while (my $sub_item = $item->next_subItem) {
            if (exists $item_data{$sub_item->get_name}) {
                $item_data{$sub_item->get_name} = $sub_item->get_content;
            }
        }

        for my $check (qw(ChrAccVer ChrStart ChrStop)) {
            die "$check not set" unless $item_data{$check};
        }

        my $strand = $item_data{ChrStart} > $item_data{ChrStop} ? 2 : 1;
        printf("Retrieving %s, from %d-%d, strand %d\n", $item_data{ChrAccVer},
                                                        $item_data{ChrStart},
                                                        $item_data{ChrStop},
                                                        $strand
                                                        );

        $fetcher->set_parameters(-id => $item_data{ChrAccVer},
                                 -seq_start => $item_data{ChrStart} + 1,
                                 -seq_stop  => $item_data{ChrStop} + 1,
                                 -strand    => $strand);
        print FILE $fetcher->get_Response->content;
    }
    close FILE;
}

for($i=0;$i<$len;$i++)
{
    $fii=$arrayids[$i];
    $old_file_nam[$i]=join('',$fii,".gbe"); 
    $new_file_nam[$i]=join('',$fii,".fasta");
    $in  = Bio::SeqIO->new(-file => "$file_nam[$i]" , '-format' => 'GenBank');
    $out = Bio::SeqIO->new(-file => ">$new_file_nam[$i]" , '-format' => 'Fasta');

    while ( my $seq = $in->next_seq() )
    {
            $out->write_seq($seq);
    }
}

print "the new file names :n@new_file_nam";

for($i=0;$i<$len;$i++)
{
    $temfi=$arrayids[$i];
    $file_del=join('',$temfi,".gbe");
    system("del $file_del");
}