I'm trying to make sense of some data and analysis I've inherited. It's a big mess. The results include some gene names that are clearly corrupted ("Excel genes", etc.).
Therefore, for starters, I want to determine a "clean" list of the "gene universe" that was used for the analysis. All I have been able to ascertain is that the reference gene data used came from UCSC's HG19 build.
By "clean" I mean that the list I'm looking for should
- be complete
- contain no synonyms
- consist of identifiers belonging to one and only one system of nomenclature
Of course, the ideal would be a simple file, published by UCSC, consisting of one gene name per line, no duplicates, but after much searching I have not been able to find such a file.
Does such a file exist? If not what would be the closest I could find?
Thanks in advance!