Hello everybody,
My question is how I can retrieve a gene sequence by providing the geneID as user input from NCBI and then store it in an Excel sheet.
thank you,
regards: vars22
Hello everybody,
My question is how I can retrieve a gene sequence by providing the geneID as user input from NCBI and then store it in an Excel sheet.
thank you,
regards: vars22
For the first part, use the NCBI Entrez interface (perhaps via BioPerl since you want to use Perl).
For the second part about Excel, As others have said too, my answer is don't do it. For one set of reasons, see: Zeeberg et al (2004) Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.
Take a look at these links?
http://www.bioperl.org/wiki/HOWTO:Getting_Genomic_Sequences
http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#esummary_-.3E_efetch
Outputting to .csv would be better, if you really want to use Excel to view these. I have used Bio::DB::Fasta and Text::CSV_XS for similar stuff in the past!
You could get the sequence using Bio::DB::EUtilities as described here. I would build a Fasta sequence object from that using Bio::Seq before writing it to a Fasta file using Bio::SeqIO.
If you are determined to look at Excel then these links might be useful?
http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SeqIO/excel.html
http://www.ibm.com/developerworks/linux/library/l-pexcel/
Also, some stuff around page 567 in Bioinformatics: sequence and genome analysis (2nd Ed)
NCBI provides a script-generation website at:
http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/ebot/ebot.cgi
Follow the directions on the screen, and it will create a perl script for you. It's verbose and comes with a lot of disclaimers, but it is very easy to modify to your needs.
The code for retrieving gene sequence using gene IDs
#!/usr/bin/perl
use Bio::DB::EUtilities;
use Bio::SeqIO;
system(cls);
print "Enter the input file name: "; #ask user for the file name containing the gene IDs
$filename=[?]; #take the file name as input
open(INPUT,$filename) or die "Cannot open input file";
@arrayids=[?]; #store the Gene IDs in an array
close INPUT;
foreach $ea(@arrayids)
{
chomp($ea);
}
$len=scalar(@arrayids);
for($i=0;$i<$len;$i++)
{
$fi=$arrayids[$i];
$file_nam[$i]=join('',$fi,".gbe"); #joins the genbank extension with the gene ID
open(FILE,">$file_nam[$i]") or die "cannot open $file_nam....";
my @ids=$arrayids[$i];
my $eutil = Bio::DB::EUtilities->new(-eutil => 'esummary',
-email => 'user@xyz.com',
-db => 'gene',
-id => \@ids);
my $fetcher = Bio::DB::EUtilities->new(-eutil => 'efetch',
-db => 'nucleotide',
-rettype => 'fasta');#returns fasta #sequence you can change the format acc to ur needs
while (my $docsum = $eutil->next_DocSum) {
my ($item) = $docsum->get_Items_by_name('GenomicInfoType');
my %item_data = map {$_ => 0} qw(ChrAccVer ChrStart ChrStop);
while (my $sub_item = $item->next_subItem) {
if (exists $item_data{$sub_item->get_name}) {
$item_data{$sub_item->get_name} = $sub_item->get_content;
}
}
for my $check (qw(ChrAccVer ChrStart ChrStop)) {
die "$check not set" unless $item_data{$check};
}
my $strand = $item_data{ChrStart} > $item_data{ChrStop} ? 2 : 1;
printf("Retrieving %s, from %d-%d, strand %d\n", $item_data{ChrAccVer},
$item_data{ChrStart},
$item_data{ChrStop},
$strand
);
$fetcher->set_parameters(-id => $item_data{ChrAccVer},
-seq_start => $item_data{ChrStart} + 1,
-seq_stop => $item_data{ChrStop} + 1,
-strand => $strand);
print FILE $fetcher->get_Response->content;
}
close FILE;
}
for($i=0;$i<$len;$i++)
{
$fii=$arrayids[$i];
$old_file_nam[$i]=join('',$fii,".gbe");
$new_file_nam[$i]=join('',$fii,".fasta");
$in = Bio::SeqIO->new(-file => "$file_nam[$i]" , '-format' => 'GenBank');
$out = Bio::SeqIO->new(-file => ">$new_file_nam[$i]" , '-format' => 'Fasta');
while ( my $seq = $in->next_seq() )
{
$out->write_seq($seq);
}
}
print "the new file names :n@new_file_nam";
for($i=0;$i<$len;$i++)
{
$temfi=$arrayids[$i];
$file_del=join('',$temfi,".gbe");
system("del $file_del");
}
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Something does not make sense to me here. If you want to store the sequence in an Excel sheet (which I think is a bad idea to start with), why would you want the sequence to be in FASTA format?
Answer to second part of question: you almost certainly do not want to store sequence data in Excel. You can prove me wrong by explaining the sequence analysis functionality in Excel that you plan to use.
what did you try so far ?
Additionally, if the goal is to retrieve sequences based on a certain user input, why is it crucial that this is done in Perl?
Might be an idea to output to .csv? Excel can still read it then and easier to parse programmatically?
i only want the sequences if it will be in fasta format i can easily extract the sequences.i have geneIDs with me of which i need sequences. i have used a module lwp and set the path of ncbi but i the result always shows connection timed out.
thanks for the help i m grateful
plz suggest me something in perl not bioperl. the bioperl link you have sent me i have already been there.thanks for the other links.