Entering edit mode
6.0 years ago
cdsouthan
★
1.9k
As we know Swiss-Prot Recommended protein names and HGNC Approved gene names can be different for the same protein. The question is exactly how many are different?
On the admittedly crude basis of exact text string matches on download lists from both sides the overlap is only 91! However, this is confounded by minor differences in initial letter cases, greek symbols, spacings, and hyphen usage. Can anyone x-check this in a more sophisticated fashion? (n.b. this has been tweeted with a Venny picture but more room for technical replies here I guess)
I am wondering why you care. Also the fact that you mention differences in spacing and hyphenation makes me think you may be looking at the wrong field(s). HGNC symbols should have neither spaces nor hyphens (with a few exceptions, see here for details). In Swiss-Prot, the gene field should be the HGNC symbol of the gene(s) they think produces the protein. However, this could be out of sync with HGNC itself depending on when each entry was last updated. If you need proteins integrated with a supporting genome in a consistent way, I suggest to use Ensembl.
Apols if I miscommunicated. Names are particularly important for the Guide to Pharmacology because we have to grapple with and curate three different sets of target names from Swiss-Prot, HGNC and the IUPHAR Nomenclature Committee (NC-IUPHAR). We use Symbols of course and collaborate with HGNC but, like everyone else we come across many cases where the Approved HGNC name and the Swiss-Prot Recommended protein name (both assigned by independent curators) are generally similar, but not an exact match, e.g,
Epidermal growth factor-like protein 7 vs EGF like domain multiple 7
Translation initiation factor eIF-2B subunit gamma vs eukaryotic translation initiation factor 2B subunit gamma
Gasdermin-A vs gasdermin-A
Mitochondrial enolase superfamily member 1 vs enolase superfamily member 1
Glutathione S-transferase A1 vs glutathione S-transferase alpha 1
It was the exact stats of intersects and diffs I was after (even though the name totals are 19,198 on one side vs 20,410 on t'other)
I thought you could get at this via the gene symbols because since Swiss-Prot deals with proteins and HGNC with genes, you won't get protein names from HGNC. I don't know about the stats and getting them using the names themselves might be tricky as one would need to decide what is an acceptable difference when not doing exact text matching (in particular when multiple protein isoforms with different names are produced by the same gene). If the goal is to know whether different resources refer to the same gene despite differing names, I would map each entity to an annotated reference genome and all names mapping to the same reference gene would be considered synonyms. Using Ensembl for this should be relatively easy since they've already mapped Uniprot entries and use HGNC symbols. I don't know about IUPHAR but I would expect them to have sequence identifiers associated with entities they name.