Question

Fetching NCBI gene symbols from NCBI protein ids or GI identifiers

1

Entering edit mode

7.1 years ago

lakhujanivijay 5.9k

I have the following information from blastx annotations of bacterial genes predicted by prodigal:

Sequence name   Sequence desc.  Sequence length Hit desc.   Hit ACC
gene_1_contig_1 excinuclease ABC subunit A  228 gi|1055624747|ref|WP_067265422.1|excinuclease ABC subunit A [Sulfitobacter sp. HI0054] gi|1024544140|gb|KZY51396.1| excinuclease ABC subunit A [Sulfitobacter sp. HI0054]   WP_067265422, KZY51396
gene_2_contig_1 excinuclease ABC subunit A  210 gi|1055651942|ref|WP_067291557.1|excinuclease ABC subunit A [Sulfitobacter sp. EhC04] gi|1032103716|gb|OAN76192.1| excinuclease ABC subunit A [Sulfitobacter sp. EhC04] WP_067291557, OAN76192
gene_3_contig_1 MFS transporter 432 gi|1055624744|ref|WP_067265419.1|MFS transporter [Sulfitobacter sp. HI0054] gi|1024544139|gb|KZY51395.1| hypothetical protein A3734_05250 [Sulfitobacter sp. HI0054]    WP_067265419, KZY51395
gene_4_contig_1 MFS transporter 561 gi|1055624744|ref|WP_067265419.1|MFS transporter [Sulfitobacter sp. HI0054] gi|1024544139|gb|KZY51395.1| hypothetical protein A3734_05250 [Sulfitobacter sp. HI0054]    WP_067265419, KZY51395

I wish to fetch gene symbols using the information (either the gi identifiers or the protein accessions) from the blastx results; may be using entrex efetch.

So, the result would be as below:

Gene Name                         Gene symbol
excinuclease ABC subunit A        UvrA

See, the link here. However, I am not sure how to proceed in this case. Can anybody please suggest something?

efetch NCBI entrez gene • 3.1k views

ADD COMMENT • link updated 7.1 years ago by Puli Chandramouli Reddy ▴ 190 • written 7.1 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Hi Vijay, Did you try using Biomart? it has some useful function to fetch gene symbols.

ADD REPLY • link 7.1 years ago by Sreeraj Thamban ▴ 300

0

Entering edit mode

The gene symbol appears to have been included in the description: https://www.ncbi.nlm.nih.gov/protein/1055624747/

ADD REPLY • link 7.1 years ago by Sej Modha 5.3k

0

Entering edit mode

Unfortunately, that is not true for all the entries which I have. That had saved a lot of time

ADD REPLY • link 7.1 years ago by lakhujanivijay 5.9k

0

Entering edit mode

How about db2db where you would convert RefSeq Protein Accession to Gene ID? https://biodbnet-abcc.ncifcrf.gov/db/db2db.php

ADD REPLY • link 7.1 years ago by Sej Modha 5.3k

0

Entering edit mode

You could do something like:

esearch -db protein -query "1055624747" | efetch -format docsum | xtract -pattern Title -element Title

Problem is you are dealing with WP* entries which are non-redundant protein entries from multiple strains etc. so the gene symbol is not separately annotated.

ADD REPLY • link 7.1 years ago by GenoMax 147k

score 3 · Accepted Answer · 2017-11-03

Hi,

You can use GI ids to retrieve associated information from uniprot "Retrieve/ID mapping" UniProtKB. Here, from "GI number" to "UniProtKB" should be selected and it will give output with all the information you need in tabular format and you can select columns of your interest.

Another way is to use batchentrez to get gene bank data and you need to parse the information.