Hi! I am currently working with human proteins and I need to map between RefSeq, Ensembl, and UniProt, because my sources of "location on protein" span multiple databases (ClinVar, UniProt etc.). Since Ensembl provides the great BioMart which helps match a lot of identifiers, I am currently using it (ENSP) as the "standard reference" for protein sequences, and all identifiers mapped to the same ENSP id are assumed to refer to the same protein sequence. Please note that I am using all transcripts/proteins as long as they have records in my evidence source databases, so it does not necessarily need to be a canonical one, etc.
However, I realized soon that it seems the id-sequence mapping consistency (let's call it so for now...) might not always be guaranteed. Taking mRNA as an example, it looks like only the MANE-selected trancripts are verified to be the same, even though BioMart can successfully map some other ENST with RefSeq transcript ID (NM). I'm not sure if protein sequences also have such variations, though I assume it would be less severe since we have a deterministic (?) MET start and TER end. The UniProt mapping adds yet another difficulty, since it is using PDB sequences where possible. Is this issue valid, and if so, is it even solvable? Would appreciate any suggestions/advice!
Consider using HGNC official list of human gene symbols (this link will open the file up in browser). This file includes mappings to various databases.
This will miss every protein for which there is no assigned HGNC symbol; which may include proteins in Ensembl + UniProt but not in RefSeq.
Sure but it would help unambiguously with a large % of proteins that are well known.
Are you aware of how many there are that don't fall in this category. I see that the hgnc file currently has 43k+ entries.
Thanks for sharing the resource! From a brief look HGNC seems not providing all transcripts/proteins of each gene. Since my task is protein-centric (my core evidence comes from ClinVar, which is essentially protein-centric), I do need different transcripts/peptides from the same gene (which raises this issue of aligning IDs and seqs concurrently) though.
As a demo example,
HGNC:20603 DHDDS
contains one entry only, but in ENSP and RefSeq there are a lot more.I have moved my post to a comment in light of your needs. Hopefully it would be of help to someone who may be looking for a single mapping that is already validated by HGNC.