Dear ALL,
I have a set of NCBI protein IDs. I know how to convert them to the protein sequences using e-utulility tools.
#!/usr/bin/perl
use strict;
use LWP::UserAgent;
my $ua = new LWP::UserAgent;
my $prot_id="WP_005451061.1";
my $response = $ua->get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&retmode=text&rettype=fasta&id=' . $prot_id);
unless ($response->is_success) {
die $response->status_line;
}
my $content = $response->decoded_content();
if (utf8::is_utf8($content)) {
binmode STDOUT,':utf8';
} else {
binmode STDOUT,':raw';
}
print $content;
I use only one protein as an example, but I know e-utilities allow a batch technique, etc, so it is not critical.
But I would like to find a way to convert any NCBI protein Id to the original nucleotide source, mRNA or whatever. I deal with bacteria, so introns, etc are not a problem. I saw a probable tool to do it in e-utilities. But I failed to finish with the nucleotide sequence, - I realized that the protein ID will change. Biomart doesn't help me so far.
Probably I have to use something like that in e-utilities (IDs are optional):
It gives me just some xml-file, but how can I transfer it to nucleotides?
Could you, please, help me? Many thanks!
Natalia
Thank you very much, David! I will try.
Dear David,
Reading http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch
I've found only the following:
Sequences
Fetch FASTA for a transcript and its protein product (GIs 312836839 and 34577063)
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=sequences&id=312836839,34577063&rettype=fasta&retmode=text
The link really leads to the seqs from NCBI.
http://www.ncbi.nlm.nih.gov/nuccore/312836839
http://www.ncbi.nlm.nih.gov/protein/34577063
But is there any way to isolate corresponding "pure" nucleotide sequence from its start to the the end? Only CDs for this protein? I think, no. Is it correct?
And nucleotide IDs are quite different from protein IDs.
Is it possible to find the nucleotide ID, having only the protein ID or GI-number?
Please, give me a hint. Sorry, I have not tried your R-script yet, maybe, it will give me a solution.
Many thanks!
Sincerely yours,
Natalia
Hi Natalia,
As far as I know this is correct. But check out "features" of the nucleotide record (in genbank format), which might give the indices of your gene on interest
You can get linked nucleotide IDs from protein IDs (but not accession) with elink. You can get protein IDs from protein accessions with esearch (using the query I have in the code above)
Thank you, David, I hope it will help.
Hi David,
I tried your code, but it always for the last step, it also gives me
protein 110 1209747831 nuccore protein_nuccore 109 nuccore protein_nuccore_cds 109 nuccore protein_nuccore_mrna 109
, even if I have input different protein accession numbers. So, does it mean that I input something wrong?Thank you in advance!
Bing
I also want to get the corresponding nucleotide sequence for each protein sequence from NCBI, because Uniprot doesn't provide this service now.