Question

how to convert NCBI protein ID to corresponding nucleotide sequence

1

Entering edit mode

9.8 years ago

natasha.sernova ★ 4.0k

Dear ALL,

I have a set of NCBI protein IDs. I know how to convert them to the protein sequences using e-utulility tools.

#!/usr/bin/perl

use strict;
use LWP::UserAgent;

my $ua = new LWP::UserAgent;
my $prot_id="WP_005451061.1";
my $response = $ua->get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&retmode=text&rettype=fasta&id=' . $prot_id);

unless ($response->is_success) {
    die $response->status_line;
}

my $content = $response->decoded_content();

if (utf8::is_utf8($content)) {
    binmode STDOUT,':utf8';
} else {
    binmode STDOUT,':raw';
}

print $content;

I use only one protein as an example, but I know e-utilities allow a batch technique, etc, so it is not critical.

But I would like to find a way to convert any NCBI protein Id to the original nucleotide source, mRNA or whatever. I deal with bacteria, so introns, etc are not a problem. I saw a probable tool to do it in e-utilities. But I failed to finish with the nucleotide sequence, - I realized that the protein ID will change. Biomart doesn't help me so far.

Probably I have to use something like that in e-utilities (IDs are optional):

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=gene&id=15718680,157427902&cmd=neighbor_history

It gives me just some xml-file, but how can I transfer it to nucleotides?

Could you, please, help me? Many thanks!

Natalia

gene sequence databases protein • 6.9k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by natasha.sernova ★ 4.0k

Ram · Accepted Answer · 2015-02-10

3

Entering edit mode

9.8 years ago

David W 4.9k

You can use the link Eutil to find linked records (there will be an "IdList" in the resultant xml), but note

(a) You have to use the gi number (not the accession as you have above) for the link Eutil.

(b) There may be multiple nucleotide records linked to a protein, and they may be much larger than the particular protein sequence (both true in this case).

Here's how it would work-flow might look like in the R package rentrez, you can no doubt adapt the following to perl or Your Favourtie Scripting Language

(search <- entrez_search(db="protein", term="WP_005451061[Accn]"))
#Entrez search result with 1 hits (object contains 1 IDs and no cookie)`

(links <- entrez_link(dbfrom="protein", db="nuccore", id=search$ids))
# elink result with ids from 3 databases:
# [1] protein_nuccore     protein_nuccore_wgs protein_nuccore_wp
length(links$protein_nuccore)
[1] 5

rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$protein_nuccore[1])
nchar(rec)
# [1] 77201

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by David W 4.9k

0

Entering edit mode

Thank you very much, David! I will try.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Dear David,

Reading http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch

I've found only the following:

Sequences

Fetch FASTA for a transcript and its protein product (GIs 312836839 and 34577063)

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=sequences&id=312836839,34577063&rettype=fasta&retmode=text

The link really leads to the seqs from NCBI.

http://www.ncbi.nlm.nih.gov/nuccore/312836839

http://www.ncbi.nlm.nih.gov/protein/34577063

But is there any way to isolate corresponding "pure" nucleotide sequence from its start to the the end? Only CDs for this protein? I think, no. Is it correct?

And nucleotide IDs are quite different from protein IDs.

Is it possible to find the nucleotide ID, having only the protein ID or GI-number?

Please, give me a hint. Sorry, I have not tried your R-script yet, maybe, it will give me a solution.

Many thanks!

Sincerely yours,
Natalia

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Hi Natalia,

But is there any way to isolate corresponding "pure" nucleotide sequence from its start to the the end? Only CDs for this protein? I think, no. Is it correct?

As far as I know this is correct. But check out "features" of the nucleotide record (in genbank format), which might give the indices of your gene on interest

Is it possible to find the nucleotide ID, having only the protein ID or GI-number?

You can get linked nucleotide IDs from protein IDs (but not accession) with elink. You can get protein IDs from protein accessions with esearch (using the query I have in the code above)

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by David W 4.9k

0

Entering edit mode

Thank you, David, I hope it will help.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Hi David,

I tried your code, but it always for the last step, it also gives me protein 110 1209747831 nuccore protein_nuccore 109 nuccore protein_nuccore_cds 109 nuccore protein_nuccore_mrna 109, even if I have input different protein accession numbers. So, does it mean that I input something wrong?

Thank you in advance!

Bing

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 6.2 years ago by bison100 • 0

0

Entering edit mode

I also want to get the corresponding nucleotide sequence for each protein sequence from NCBI, because Uniprot doesn't provide this service now.

ADD REPLY • link 6.2 years ago by bison100 • 0