Question

NCBI GI to Gene Description

1

Entering edit mode

10.3 years ago

navillusol858 ▴ 10

Hello all,

I have a very large list of NCBI gene IDs (such as, gi:47221249, ect). I am hoping to use this list to get the descriptions for each of the gene IDs. Using the GI above it would be "unnamed protein product [Tetraodon nigroviridis]".

Thus ending up with a file that has two columns, one with gene IDs and the other with the description for these IDs.

Would anyone know of a script/software already available to do a job such as this?

Thanks for the help!

blast gene next-gen • 13k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by navillusol858 ▴ 10

Ram · Accepted Answer · 2014-08-20

4

Entering edit mode

10.3 years ago

5heikki 11k

With Entrez Direct you can:

efetch -id 47221249 -db protein -format docsum | xtract -element Title
unnamed protein product [Tetraodon nigroviridis]

Or if you have Blast installed you could fetch latest nr and query it with blastdbcmd.

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 10.3 years ago by 5heikki 11k

0

Entering edit mode

Hi 5heikki, thanks for the help, can I ask if it is possible to give efetch a file that contains a list of IDs, and get it to return the Titles in a list also (a file)?

ADD REPLY • link 10.3 years ago by navillusol858 ▴ 10

0

Entering edit mode

As far as I know, you can't pass it a list as such, but it's trivial to script it. For example in Bash shell:

while read line; do title=$(efetch -id $line -db protein -format docsum | xtract -element Title); echo "$line      $title"; done<listOfGis.txt

ADD REPLY • link updated 5.3 years ago by Ram 44k • written 10.3 years ago by 5heikki 11k

0

Entering edit mode

Thanks 5heikki, that is exactly what I was looking for, very much appreciated!

Regards

ADD REPLY • link 10.3 years ago by navillusol858 ▴ 10

Ram · Accepted Answer · 2014-08-20

2

Entering edit mode

10.3 years ago

Alastair Kerr 5.3k

When possible I suggest avoid using gi identifiers as I have seen many cases where they have been unable to retrieve historical data.

You could use batch Entrez, and extract the information you need from the resulting file, remembering to select the appropriate database (nucleotide, protein etc).

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Alastair Kerr 5.3k

0

Entering edit mode

Thanks for the help Alastair! I also have GenBank IDs, do you think it would be more accurate to use them?

Regards

ADD REPLY • link 10.3 years ago by navillusol858 ▴ 10

1

Entering edit mode

If using NCBI ideally I would try and use RefSeq ids, with the revision number if you have it. See http://www.ncbi.nlm.nih.gov/books/NBK50679/

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Alastair Kerr 5.3k

1

Entering edit mode

Where possible it is usually best to use the accession (e.g. K00650) rather than the GenBank/GenPept Locus/Id (e.g. HUMFOS) or the NCBI GI number (e.g. 182734). The Locus/Id is not guaranteed to be stable and can change between releases. The GI number refers to a specific version of the sequence, which may change in later revisions, and as a bare number suffers from anonymous identifier syndrome (e.g. is a particular GI a protein or nucleotide sequence?). The accession, or if reference to a specific sequence is required the accession based sequence version (e.g. K00650.1), are guaranteed to be stable and persistent.

Since the accession is shared across the INSDC member databases (i.e. DDBJ, ENA and GenBank), using the accession has the advantage of allowing the use of any of the INSDC database for retrieval of nucleotide sequences (whole entry or CDS features). For protein sequences the accession used in GenBank/GenPept is the INSDC protein_id which is also shared across INSDC and is used in databases which consume data from the INSDC databases, for example in UniParc which has mappings to other sources which share the same protein sequence, and UniProtKB through the import of CDS translations into UniProtKB/TrEMBL, as well as providing a CDS identifier as used to provide CDS entries in ENA Coding.

For RefSeq entries the same principle applies, but they use the accession as the Locus/Id in the GenBank format. This is also the case in UniProtKB, where the entry name (ID) is a human friendly mnemonic which is subject to change, but the primary accession is the stable identifier.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by hpmcwill ★ 1.2k