There are 19035 protein-coding rows in the HGNC download but the UniProt 19035 column collapses to 18883 infering 432 one-to-many Swiss-Prot > HGNC
However, when I query UniProt with database:(type:hgnc) AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]" I get 19960 from the 20,168, implying 905 for the same 1:many - but I can only find 152 duplicates in the column
Can amyone whos been doing something similar help out here? (note it falls between two help desks)
After some hours of head scratching, cross checking and making Venn intersects (see twitter) I think I have an explanation. So no one needs to dive into this if they have better things to do, but I will hold off on my conclusions for a time just to see if anyone wants to come up with an independently corroborative explanation (which I actually think is important for the domain of protein annotation)
Thanks for all the comments, I managed the review in the end "Last rolls of the yoyo: Assessing the human canonical protein count [version 1; referees: awaiting peer review]" https://f1000research.com/articles/6-448/v1 feedback welcome
In UniProt release 2017_02 there are 171 UniProt/Swiss-Prot entries with more than one HGNC link. While 52 HGNC links point to more than one UniProtKB/Swiss-Prot entry
For data on the HGNC side unfortunately it misses a SPARQL endpoint so no nice way to do this kind of analytics.
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT
?protein
(GROUP_CONCAT(SUBSTR(STR(?db),30);separator=',') AS ?hgncs)
WHERE
{
?protein a up:Protein .
?protein up:reviewed true.
?protein rdfs:seeAlso ?db .
?db up:database <http://purl.uniprot.org/database/HGNC>} GROUP BY ?protein HAVING (COUNT(DISTINCT(?db))>1)
The inverse query asking for hgnc links present in more than one UniProtKB/Swiss-Prot entry.
PREFIX up:<http://purl.uniprot.org/core/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT
?db
(GROUP_CONCAT(SUBSTR(STR(?protein),33);separator=',') AS ?proteins)
WHERE
{
?protein a up:Protein .
?protein up:reviewed true.
?protein rdfs:seeAlso ?db .
?db up:database <http://purl.uniprot.org/database/HGNC>} GROUP BY ?db HAVING (COUNT(DISTINCT(?protein))>1)
OK, thanks, but the biological/curation issue behind the numbers above is as follows:
It looks like Swiss-Prot have included a large number of proteins (in the order of ~ 500-800) that HGNC are not classifying as protein-coding. The largest categories I think (by manual inspection of matches from segments from the Venn I put on twitter) are endogenous retrovirus, long non-coding RNAs and odour receptor pseudogenes. This is numerically dominant over the relatively small one-to-many (SP < > HGNC in both directions as Jerv shows) which I think they agree on as proteins.
After some hours of head scratching, cross checking and making Venn intersects (see twitter) I think I have an explanation. So no one needs to dive into this if they have better things to do, but I will hold off on my conclusions for a time just to see if anyone wants to come up with an independently corroborative explanation (which I actually think is important for the domain of protein annotation)
Thanks for all the comments, I managed the review in the end "Last rolls of the yoyo: Assessing the human canonical protein count [version 1; referees: awaiting peer review]" https://f1000research.com/articles/6-448/v1 feedback welcome