I am struck by the, let's say, lexical heterogeneity of the entries in the geneSymbol
column of UCSC hg19's kgXref
table. Here's a sample[1]:
T
AR
C3
TRA@
HGC6.3
Z70701
unknown
Ig kappa
TIMELESS
5_8S_rRNA
OK/SW-cl.16
cytochrome b
Em:AC005003.4
Ig alpha 1-[alpha]2m
DTX2P1-UPK3BP1-PMS2P11
aromatase cytochrome P-450 (P-450AROM)
immunoglobulin epsilon chain constant...
T-cell receptor alpha chain variable ...
I would like to know more about the "semantics" of this table's geneSymbol
column, but I am having a really hard time finding authoritative [2] answers to my questions. (These questions include, among others, the following. What is the provenance of these "gene symbols"? Is UCSC the ultimate authority on them, or are they getting these symbols from some other authority? Who/what ensures that distinct symbols always refer to distinct genes? etc.)
If I go to http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19, select kgXref
from the "table" dropdown, and then click on "describe table schema", the resulting page shows a lot of useful information, but it does not tell me anything about how this table was put together. In particular, it tells me nothing relevant to questions like the ones mentioned earlier.
[1] The ...
at the end of the last two entries in the list belong, in fact, to the values stored in the table. The length of both of these geneSymbol
entries is 40; they are the longest ones in the table. FWIW, the type of the geneSymbol
column is varchar(255)
.
[2] By "authoritative answers" I mean answers that come from a publication (preferably peer-reviewed) authored by those who produced the database. It is not too difficult to come by educated guesses to answer at least some of these questions. I probably could do a passable job myself, but this is not what I am after.
Thanks. Do you by any chance know if the UCSC database has some identifier/column that uniquely identifies human genes (in a strict 1-to-1 correspondence between genes and these identifiers)? The so-called "known gene ID" (aka kgID) cannot be it, because there are 82,960 distinct kgIDs in kgXref, which seems to me just too high. (Here again, it sure would be nice to have some authoritative documentation on the semantics of the kgID column.) In contrast, kgXref mentions only 28,514 distinct "geneSymbols", a number that seems to me more in line with the commonly cited estimates of the number of genes in the human genome. I sure hope, however, that the UCSC genome database has a more carefully controlled set of identifiers than these chaotic "geneSymbols" to uniquely identify what is probably the most important entity in their database.
Many of us avoid UCSC since their annotation have historically been...problematic. You'll likely be better off with Ensembl/Gencode. Ensembl IDs should prove to be a superset of what's in HGNC.