Question

Gene symbol disambiguties when match different gene symbol lists.

0

Entering edit mode

8.6 years ago

dolevrahat ▴ 40

Hello

I am working on an analysis which requires integration of data per gene from several databases and screens. Since gene symbols vary between the different data sources I am using, I tried to come up with a way to match the gene symbols. I have tried querying the geneinfo table for homo sapiens that I downloaded from NCBI for aliases of genes with incompatible symbols.

but then I found out that some gene symbols correspond to multiple genes.

For example: the symbol C10orf2 is associated both with a gene in chromosome 10, and with the gene CHMP1B on chromosome 18. This observation was also confirmed by search bioDBnet, which was recomended in a previous post.

I have also tried using the geneSynonym package but ran into similar problems.

Does anyone have an idea why this type of disambiguites happen? More practically, if anyone ran into such a problem before I would appreciate any suggestions as to how to match the gene symbols lists in a way that will not be ambiguous.

(obviously it would probably be better to compare IDs such as entrez IDs or ENSGs/ENSPs, but not all the sources that I use provide these).

Thanks in advance

Dolev Rahat

gene_symbol • 2.2k views

ADD COMMENT • link updated 8.6 years ago by John 13k • written 8.6 years ago by dolevrahat ▴ 40

1

Entering edit mode

Victor McKusick, the guy responsible for OMIM and the PI to my old PI (my grand-PI?), once said "Genes are like rivers - no one can really point to exactly where they start or end, and the middle bit is always changing, but we all agree that they should be named after the people who find them... unless, of course, a more popular name comes along - usually one that describes what happens when the river disappears." ... "If you ever get a chance to name a gene, best to just name it after it's sequence at the time you found it."

The guy had 0 knowledge of anything computery, but he's totally right - naming genes is dumb to begin with. If you really must, just use 1 naming schema and define the names based on reference position. Anything beyond that becomes really really messy really quickly.

ADD REPLY • link 8.6 years ago by John 13k

1

Entering edit mode

8.6 years ago

Denise CS ★ 5.2k

The ambiguity is due to the fact that although those are official HGNC gene names (e.g. C10orf2 and CHMP1B), they have got synonyms. CHMP1B is known as C10orf2 according to EntrezGene but not HGNC. THis is causing the problem here. C10 is for chr 10, so how can this symbol be used as a synonym to CHMP1B, which is on chr 18? I'd suggest you to deal with the official gene names and forget about the synonyms. It should be worth telling Entrez about this example so that they can rectify this. Hopefully there are not many cases like this.

ADD COMMENT • link 8.6 years ago by Denise CS ★ 5.2k

0

Entering edit mode

Note that Ensembl has also inherited C10orf2 as synonym for CHMP1B so maybe Ensembl could do some checks for this.

ADD REPLY • link 8.6 years ago by Jean-Karim Heriche 27k

1

Entering edit mode

Yes I've noticed it and already have let the developers know about this. We've inherited this synonym from EntrezGene whereas we should have got the synonyms from HGNC, since they are our source for the official gene names in human. We will look into it.

ADD REPLY • link 8.6 years ago by Denise CS ★ 5.2k

score 2 · Accepted Answer · 2016-05-09

There are several reasons different loci can be associated with a gene symbol. Most of them boil down to differences in genome annotation either due to different databases doing different things or differences between databases versions. A symbol could have been associated with a sequence that was then found in multiple copies in the genome or a symbol has been assigned independently to different genes or some data sources use outdated symbols. You also have to decide what a gene is for you. For many biologists, a gene is defined by its symbol so any transcript or protein associated with that symbol, wherever they come from in the genome, are part of the gene, i.e. for all practical purposes, biologists often consider duplicated sequences producing the same products as one gene. Alternatively, a gene can be defined by a set of related transcripts/proteins produced by the same locus. This is the type of gene definition used by Ensembl. In this version duplicated sequences give rise to different genes even if they have the same products. To disambiguate, try to map the symbols to IDs, even from different databases then reconcile the IDs and pay attention to versioning. If you know how current your symbols are, you could also use a contemporary Ensembl version to map them to Ensembl genes, considering that your symbols are/were official gene symbols for the genes at the time. For example, there's currently only one Ensembl gene with official symbol C10orf2, other uses of C10orf2 are as synonyms.