Hello everyone, I tried to find the answer in the previous question. I have a list of NCBI accessions (GenPept) of protein sequences, I'd like to retrieve a corresponding nucleic sequences returned in fasta file using bioperl, someone have ideas?
Hello everyone, I tried to find the answer in the previous question. I have a list of NCBI accessions (GenPept) of protein sequences, I'd like to retrieve a corresponding nucleic sequences returned in fasta file using bioperl, someone have ideas?
THIS FACILITY IS PROVIDED BY NCBI ITSELF..HOPE THIS WILL HELP
The simplest way is to use E-Utilities, which are already ready and written in perl. You can pass the same type of complex queries one can use in the web interface of Entrez (including ranges). The problem in your case is that you can't use accn of protein to get its gene/mrna directly. Given that you have GenPept, you could use it to retrieve the DBSOURCE locus/accession of the gene and feed it into Efetch.
I think you could genpept accn to retrieve genes in NCBI.
Edit:
This solution works for finished genomes with annotated genes and transcripts. For drafts the situation is much more delicate and depends heavily on the annotation quality.
By the end of the page, you'll see a section "Demonstration programs". Click there and save the script. There's a lot of useless comments inside. The actual perl part is pretty simple straightforward. http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/eutils_example.pl
Thanks again, I'm trying o use E-Uilities examples/scripts, can not work for my case, I want to get the corresponding CDS of the protein. For example I have the ID of the protein ID ZP_07126586.1 (link : ncbi.nlm.nih.gov/protein/ZP_07126586.1) in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format!
/#!/usr/bin/perl
use strict;
use warnings;
use Bio::DB::GenBank;
use Bio::SeqIO;
# get command-line arguments, or die with a usage statement
my $usage = "RetriveDataGenBank.pl Accessions_infile Fasta_outfilen";
my $infile = shift or die $usage;
my $outfile = shift or die $usage;
my $outfileformat = 'Fasta';
my $gb = Bio::DB::GenBank->;new(-retrievaltype => 'tempfile' , -format => 'Fasta');
my @Acc_Numbers;
open (IN,$infile)||die;
my $line;
while ($line =[?]) {
chomp $line;
push (@Acc_Numbers, "'".$line."'")
}
my $seq_in = $gb->get_Stream_by_acc(@Acc_Numbers);
my $seq_out = Bio::SeqIO->new('-file' => ">$outfile",'-format' => $outfileformat);
# write each entry in the input file to the output file
while (my $inseq = $seq_in->next_seq) {
$seq_out->write_seq($inseq);
}
exit
your INPUT - list of GI's
your OUTPUT - Fasta sequences.fasta
Ilia
Use eutils as follow: 1. first search each protein against the "gene" database to get genes ID's of each protein. - use esearch 2. the search these genes in the "nucleotide" database. - esearch again. 3. use efetch to download the nucleotide FASTA sequences. see also: http://www.ncbi.nlm.nih.gov/books/NBK1058/
Thank you Dror, I'll try to apply your steps to write a corresponding script! but the problem her is, for some protein we can't find the corresponding gene, so I want just to retrieve the corresponding CDS sequence! For example I have the ID of the protein ID ZP_07126586.1 (link : ncbi.nlm.nih.gov/protein/ZP_07126586.1) in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format!
Hello everyone,
I'd like to show you the code that I trying with:
#/usr/bin/perl
#use strict;
use warnings;
use Bio::DB::GenBank;
use Bio::SeqIO;
use Bio::SeqFeatureI;
#open my $OUTFILE, '>', "Gb_parser.fasta";
my $gb = new Bio::DB::GenBank;
# Récupération dans $seq de l'objet Genbank contenant de nombreuses informations
# Gi, Acc, séquence, Annotations, Organisme, Espèce, Genre ...
my @Acc = qw(ABK27677); #Accession number de la séquence à traiter
my $seq1 = $gb->get_Seq_by_acc(@Acc);
my $Sequence = $seq1->seq();
my $Description = $seq1->desc();
print "Acc = @Acc\nDescription = $Description\n"; #ecrit le num d'accession et sa description associée
my $seqio_object = $seq1;
for my $feat_object ($seqio_object->get_SeqFeatures) {
#print "\n", $feat_object->primary_tag, "\n";
for my $tag ($feat_object->get_all_tags){
print $tag." ";
for my $value ($feat_object->get_tag_values($tag)){
print $value."\n"
}
}
}
foreach $feat ( $seqio_object->get_SeqFeatures() ) {
if ($feat->primary_tag() eq "CDS"){ # ne s'intéresse qu'aux tags "CDS"
foreach $tag ( $feat->get_all_tags() ) {
if ($tag eq "locus_tag"){
print "\n>the locus: ", join(' ',$feat->get_tag_values($tag)); #ecrit le locus
}
elsif ($tag eq "product"){
print " ", join(' ',$feat->get_tag_values($tag)); #ecrit le nom du CDS
}
elsif ($tag eq "coded_by"){
print "\n\nThe corresponding gene acc: ", $feat->get_tag_values($tag);
#print ">\n\n", join (' ', $feat-> get_tag_values($tag)); #affiche l'acc. du gene
}
}
#print "\n"
print " \n\n", $feat->start, "..", $feat->end,"\n\n";;# " on strand ", $feat->strand, "\n";
$out = $feat->seq; #out est un Primary::Seq
$string = $out->seq(); #recupère la séquence
print $string."\n\n";
}
}
There is 2 problem with my script:
coded_by
infos that I don't know how I can use it to get the nucleic sequenceUse of this site constitutes acceptance of our User Agreement and Privacy Policy.
Do you want the gene or the transcript?
I want to get the corresponding CDS. For example I have the ID of the protein ID ZP_07126586.1 (link : http://www.ncbi.nlm.nih.gov/protein/ZP_07126586.1), in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format!
Your case is much complicated than you've stated. This is a WGS draft. You don't have separated transcripts/genes. So, you have to extract your genes/CDSs from the WGS sequence. You really should restate your question and specify your target. Do your list of accn is from the same draft?
Exact, I have to retrieve the genes/CDS sequences from WGS in FASTA format. My input data is a list of NCBI Reference Sequences of the corresponding proteins (ZP_05752176.1, ZP_05752888.1, ZP_06245300.1, ...)
Hi Samad, Have you retrieved the The Fasta Nucleic Sequences Of A List Of Ncbi Accession Number Of Proteins successfully finally? If yes, could you please share your successful experience to us? Now, I also want to get the nucleotide sequence for each protein gene accession number? I used the search, elink, efetch ---these three E - utilities functions, but I always get "protein 110 1209747831 nuccore protein_nuccore 109 nuccore protein_nuccore_cds 109 nuccore protein_nuccore_mrna 109" at the last step. Did you have the similar experience?
This is an old thread, so not sure Samad is stil active, but did you check this? Retrieve Mrna Seq From Ncbi Given A List Of Protein Accessions I think there are several more questions related to protein ID to RNA/DNA sequence here already. Which solution have you tried?
Hi Michael,
At first, thank you for your reply!
Below is the way I have tried until now, which doesn't work well.
Protein seq ---> Nucleotide seq:
Finally, I always get,
So, now I want to try new ways....
At first, input this website. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=A3DH67 ---to get protein id : 110 145558928
Then input this site. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=nuccore&id=110 145558928 ---to get protein coding mrna id : nuccore protein_nuccore_mrna 109
Finally, input this. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=109