Retrieving Official Gene Symbols From Full Length Protein Names Automatically

Entering edit mode

14.2 years ago

Eric Normandeau 11k

Hi,

I have a list of a few hundred protein names and I would like to be able to retrieve their names automatically. For example:

Dihydroorotate dehydrogenase, mitochondrial precursor
ubiquitin-protein ligase
AP-2 complex subunit mu-1-A
Proliferation-associated protein 2G4

Becomes:

Dhodh
Rnf19a
Ap2m1
MRPL4

I am presently using UniProtKB manually, but I would very much like to automatize it. Would anyone have a suggestion about the following:

What database to use?
What approach/program/package to query it?
Online vs. downloading the database?
Any other means of doing this?

I don't mind having to write a parser for a database if needed, but I don't know what source to start with.

Thanks!

annotation protein gene conversion • 4.8k views

ADD COMMENT • link updated 14.2 years ago by Neilfws 49k • written 14.2 years ago by Eric Normandeau 11k

Entering edit mode

I think a better title for this question might be "Retrieving official gene/protein symbols from full length gene/protein names automatically"

ADD REPLY • link 14.2 years ago by Casey Bergman 18k

Entering edit mode

@Casey: Done :)

ADD REPLY • link 14.2 years ago by Eric Normandeau 11k

Entering edit mode

Thank you all for your comments and suggestions! Having the latest hot computer --> 2500$; One full run of 454 sequencing --> 6000$; Biostar Forum --> Priceless ;)

ADD REPLY • link 14.2 years ago by Eric Normandeau 11k

Entering edit mode

14.2 years ago

Michael Kuhn 5.0k

You can use the STRING API for this, like so:

echo "Dihydroorotate dehydrogenase, mitochondrial precursor" | \
xargs -i wget -nv -O - \
'http://stitch.embl.de/api/tsv-no-header/resolve?identifier={}&species=9606&echo_query=1' \
>; protein_names.tsv

which gives you, among the Ensembl id, also the gene name DHODH:

Dihydroorotate dehydrogenase, mitochondrial precursor   9606.ENSP00000219240    9606    Homo sapiens    DHODH   Dihydroorotate dehydrogenase, mitochondrial precursor (EC 1.3.3.1) (Dihydroorotate oxidase) (DHOdehase)

Plus, now you have valid STRING identifiers you can use to query the network. :-)

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 14.2 years ago by Michael Kuhn 5.0k

Entering edit mode

Thanks for this method, Micheal. I'll finish my boring manual annotation and will test against the results obtained with the STRING API. I'll try to automatize extraction of the results in the numerous cases where there are many, but that may be tricky. This may end up being 'Computer assisted manual annotation' :)

ADD REPLY • link 14.2 years ago by Eric Normandeau 11k

Entering edit mode

This is actually what I did a while ago when mapping a set of protein names (extracted from a collaborators Excel table...): pipe all names into the API, and then edit the protein_names.tsv file to prune mismatches.

ADD REPLY • link 14.2 years ago by Michael Kuhn 5.0k

Entering edit mode

14.2 years ago

Larry_Parnell 16k

Good question because this is a common task. In fact, it would be nice if someone could make available their table mapping protein name to gene/protein symbol because the task does not need to be repeated. I could definitely use this for human, mouse and rat.

You could try the HUGO / HGNC site for a list of the accepted or official names and symbols.

ADD COMMENT • link 14.2 years ago by Larry_Parnell 16k

Entering edit mode

The mapping tables that are used by STRING in the solution by Michael Kuhn are freely available from the STRING download page :-)

ADD REPLY • link 14.2 years ago by Lars Juhl Jensen 11k

Entering edit mode

Thank you, Lars. This is exactly what I meant - and not knowing of the resource creates an obstacle to my work moving forward. That's solved!

ADD REPLY • link 14.2 years ago by Larry_Parnell 16k

Entering edit mode

14.2 years ago

Pierre Lindenbaum 166k

The following java program use NCBI-Utilities to query the Gene database.

If only one item is found, it prints the official gene symbol to stdout.

Else there is an ambiguity: it displays an interactive table and asks the user to select the correct row.

	import java.awt.Dimension;
	import java.net.URLEncoder;
	import java.util.ArrayList;
	import java.util.List;
	import java.util.logging.Logger;

	import javax.swing.JOptionPane;
	import javax.swing.JScrollPane;
	import javax.swing.JTable;
	import javax.swing.ListSelectionModel;
	import javax.swing.table.DefaultTableModel;
	import javax.xml.parsers.DocumentBuilder;
	import javax.xml.parsers.DocumentBuilderFactory;
	import javax.xml.xpath.XPath;
	import javax.xml.xpath.XPathConstants;
	import javax.xml.xpath.XPathFactory;

	import org.w3c.dom.Document;
	import org.w3c.dom.NodeList;

	public class Biostar5460
	{
	private Logger LOG=Logger.getLogger("Biostar5460");
	private class Item
	{
	String id="";
	String Prot_ref_desc="";
	String Entrezgene_summary="";
	String locus;
	Item(String id)
	{
	this.id=id;
	}
	}

	private DocumentBuilder builder;
	private XPath xpath;
	private Biostar5460() throws Exception
	{
	DocumentBuilderFactory factory=DocumentBuilderFactory.newInstance();
	factory.setCoalescing(true);
	factory.setNamespaceAware(false);
	factory.setExpandEntityReferences(true);
	factory.setValidating(false);
	factory.setIgnoringComments(true);
	factory.setIgnoringElementContentWhitespace(true);
	builder=factory.newDocumentBuilder();

	this.xpath=XPathFactory.newInstance().newXPath();
	}
	private void search(String term) throws Exception
	{
	LOG.info(term);
	String uri="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&retmode=xml&tool=biostar5460" +
	"&mail=me_at_nowhere_com&term="+
	URLEncoder.encode(term+" \"Homo sapiens\"[ORGN]","UTF-8");
	LOG.info(uri);
	Document dom=builder.parse(uri);
	NodeList idList=(NodeList)this.xpath.evaluate("/eSearchResult/IdList/Id", dom, XPathConstants.NODESET);
	if(idList.getLength()==0)
	{
	System.out.println("#NOT-FOUND\t"+term);
	return;
	}
	List<Item> array=new ArrayList<Item>(idList.getLength());
	for(int i=0;i< idList.getLength();++i)
	{
	LOG.info((i+1)+"/"+idList.getLength());
	Item item=new Item(idList.item(i).getTextContent());
	uri="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=xml&retmax=100&id="+item.id;
	LOG.info(uri);
	dom=builder.parse(uri);
	item.locus=(String)xpath.evaluate("/Entrezgene-Set/Entrezgene/Entrezgene_gene/Gene-ref/Gene-ref_locus", dom,XPathConstants.STRING);
	item.Prot_ref_desc=(String)xpath.evaluate("/Entrezgene-Set/Entrezgene/Entrezgene_prot/Prot-ref/Prot-ref_desc", dom,XPathConstants.STRING);
	item.Entrezgene_summary=(String)xpath.evaluate("/Entrezgene-Set/Entrezgene/Entrezgene_summary", dom,XPathConstants.STRING);
	array.add(item);
	}
	if(array.size()==1)
	{
	System.out.println(array.get(0).locus+"\t"+term);
	}
	else
	{
	DefaultTableModel m=new DefaultTableModel(new String[]{"id","locus","desc","summary"}, array.size());
	for(int i=0;i< array.size();++i)
	{
	Item item=array.get(i);
	m.setValueAt(item.id, i, 0);
	m.setValueAt(item.locus, i, 1);
	m.setValueAt(item.Prot_ref_desc, i, 2);
	m.setValueAt(item.Entrezgene_summary, i, 3);
	}
	JTable table=new JTable(m);
	table.setSelectionMode(ListSelectionModel.SINGLE_SELECTION);
	JScrollPane scroll=new JScrollPane(table);
	scroll.setPreferredSize(new Dimension(800,500));
	if(JOptionPane.showConfirmDialog(null, scroll,
	"Select",
	JOptionPane.OK_CANCEL_OPTION,JOptionPane.QUESTION_MESSAGE,null)
	!=JOptionPane.OK_OPTION)
	{
	System.out.println("#NOT-FOUND\t"+term);
	return;
	}
	if(table.getSelectedRow()==-1)
	{
	System.out.println("#NOT-SELECTED\t"+term);
	return;
	}
	System.out.println(array.get(table.getSelectedRow()).locus+"\t"+term);
	}
	}
	public static void main(String[] args)
	{
	try {
	Biostar5460 app=new Biostar5460();
	for(int i=0;i< args.length;++i)
	{
	app.search(args[i]);
	}
	}
	catch (Exception e)
	{
	e.printStackTrace();
	}
	}
	}

view raw biostars-5447.java hosted with ❤ by GitHub

Compilation:

javac Biostar5460.java

Execution:

java Biostar5460 "Dihydroorotate dehydrogenase, mitochondrial precursor" "ubiquitin-protein ligase" > output.tsv

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 14.2 years ago by Pierre Lindenbaum 166k

Entering edit mode

14.2 years ago

Casey Bergman 18k

For some species, like D. melanogaster, there are look-up tables between full-length gene/protein name synonyms and their gene symbols that you could try to parse directly.

More generally, I think your problem is the same as the gene/protein name normalization (GNN) problem, which is currently a matter of active research in the text mining community. If so, then it appears there is no current solution to resolve full length gene/protein names to database identifiers and thence to official gene IDs, as in your case.

The state of the art methods in gene/protein name normalization problem are GNAT and geneTUKit, but they still may not do as well as you like. I also suspect that the same problems experienced by GNN will be experienced by the solutions proposed by Pierre and Michael (which I think are nevertheless both valid and worth trying). For example, running Michael's STRING approach yields the following promising, but not bullet-proof, results:

num_hits  official_name_found   full_gene_name 
1         Dhodh found           Dihydroorotate dehydrogenase, mitochondrial precursor 
193       Rnf19a not found *    ubiquitin-protein ligase 
3         Ap2m1 found           AP-2 complex subunit mu-1-A
1         MRPL4 not found       Proliferation-associated protein 2G4

* several other RNF family members found

Thus you may not get a single hit or the desired gene name with this approach or any other. Unfortunately, inherent variability in gene/protein name usage may be the enemy in the search for a fully automated solution to this problem.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 14.2 years ago by Casey Bergman 18k

Entering edit mode

14.2 years ago

Neilfws 49k

It's good to learn that there are resources like STRING which can help solve this common problem.

I would just make a general point: "names" are inherently ambiguous. Not just because there are many - even many synonyms for one object - but because of factors beyond your control: misspelling, erratic use of upper versus lower case and so on. This makes any kind of name-based search very difficult, which is why identifiers (accessions, official symbols) are preferred. It's often easier to query using IDs and retrieve names than the other way around.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 14.2 years ago by Neilfws 49k