Question

HGNC cross-references in UniProt

0

Entering edit mode

8.2 years ago

cdsouthan ★ 1.9k

There are 19035 protein-coding rows in the HGNC download but the UniProt 19035 column collapses to 18883 infering 432 one-to-many Swiss-Prot > HGNC

However, when I query UniProt with database:(type:hgnc) AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]" I get 19960 from the 20,168, implying 905 for the same 1:many - but I can only find 152 duplicates in the column

Can amyone whos been doing something similar help out here? (note it falls between two help desks)

HGNC human proteins uniprot • 2.3k views

ADD COMMENT • link 8.2 years ago by cdsouthan ★ 1.9k

0

Entering edit mode

After some hours of head scratching, cross checking and making Venn intersects (see twitter) I think I have an explanation. So no one needs to dive into this if they have better things to do, but I will hold off on my conclusions for a time just to see if anyone wants to come up with an independently corroborative explanation (which I actually think is important for the domain of protein annotation)

ADD REPLY • link 8.2 years ago by cdsouthan ★ 1.9k

0

Entering edit mode

Thanks for all the comments, I managed the review in the end "Last rolls of the yoyo: Assessing the human canonical protein count [version 1; referees: awaiting peer review]" https://f1000research.com/articles/6-448/v1 feedback welcome

ADD REPLY • link 8.1 years ago by cdsouthan ★ 1.9k

score 1 · Answer 1 · 2017-02-20

In UniProt release 2017_02 there are 171 UniProt/Swiss-Prot entries with more than one HGNC link. While 52 HGNC links point to more than one UniProtKB/Swiss-Prot entry

For data on the HGNC side unfortunately it misses a SPARQL endpoint so no nice way to do this kind of analytics.

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
SELECT 
    ?protein 
    (GROUP_CONCAT(SUBSTR(STR(?db),30);separator=',') AS ?hgncs)
WHERE
{
   ?protein a up:Protein .
   ?protein up:reviewed true .
   ?protein rdfs:seeAlso ?db .
   ?db up:database <http://purl.uniprot.org/database/HGNC>
} GROUP BY ?protein HAVING (COUNT(DISTINCT(?db)) >1)

The inverse query asking for hgnc links present in more than one UniProtKB/Swiss-Prot entry.

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
SELECT 
?db 
(GROUP_CONCAT(SUBSTR(STR(?protein),33);separator=',') AS ?proteins)
WHERE
{
  ?protein a up:Protein .
  ?protein up:reviewed true .
  ?protein rdfs:seeAlso ?db .
  ?db up:database <http://purl.uniprot.org/database/HGNC>
} GROUP BY ?db HAVING (COUNT(DISTINCT(?protein)) >1)