Hello
I am working on an analysis which requires integration of data per gene from several databases and screens. Since gene symbols vary between the different data sources I am using, I tried to come up with a way to match the gene symbols. I have tried querying the geneinfo table for homo sapiens that I downloaded from NCBI for aliases of genes with incompatible symbols.
but then I found out that some gene symbols correspond to multiple genes.
For example: the symbol C10orf2 is associated both with a gene in chromosome 10, and with the gene CHMP1B on chromosome 18. This observation was also confirmed by search bioDBnet, which was recomended in a previous post.
I have also tried using the geneSynonym package but ran into similar problems.
Does anyone have an idea why this type of disambiguites happen? More practically, if anyone ran into such a problem before I would appreciate any suggestions as to how to match the gene symbols lists in a way that will not be ambiguous.
(obviously it would probably be better to compare IDs such as entrez IDs or ENSGs/ENSPs, but not all the sources that I use provide these).
Thanks in advance
Dolev Rahat
Victor McKusick, the guy responsible for OMIM and the PI to my old PI (my grand-PI?), once said "Genes are like rivers - no one can really point to exactly where they start or end, and the middle bit is always changing, but we all agree that they should be named after the people who find them... unless, of course, a more popular name comes along - usually one that describes what happens when the river disappears." ... "If you ever get a chance to name a gene, best to just name it after it's sequence at the time you found it."
The guy had 0 knowledge of anything computery, but he's totally right - naming genes is dumb to begin with. If you really must, just use 1 naming schema and define the names based on reference position. Anything beyond that becomes really really messy really quickly.