I currently have a large list of Ensembl protein IDs (ENSP) that are from GRCh37. I need to map these IDs to the entry name listed on the UniProt website (e.g. 'CASPE_HUMAN' ). I am having trouble doing this using the UniProt dataset since it is up to date with the GRCh38 Ensembl IDs. Right now, I have a dataset that maps GRCh37 IDs to UniProtKB-AC (e.g. P31944)- some of these UniProt IDs are obsolete though. Is there a way I can see which Ensembl IDs have been updated in GRCh38 version? My overall goal is to find the updated UniProt IDs for the list of GRCh37_IDs I have.
I would love to have a dataframe that looks like (currently using Python):
GRCh37_ID GRCh38_ID Old UniProt New UniProt
ENSP001 ENSP001 P1234 P1234
ENSP002 ENSP004 P4567 P5632
ENSP003 ENSP009 P1292 P1292
ENSP004 ENSP0012 P1434 P2434
After this, I could just grab the new Uniprot ID that corresponds to my old GRCh37_IDs to find the entry name. Is this possible? I've been struggling to figure this out.
Recap: I started with a list of Ensembl Translation/Protein stable IDs (ENSPs) for GRCh37 and I want to find their UniProtKB-SwissProtIDs. The issue I am having is that when I use BioMart, there are some UniProtKB-SwissProtIDs included that are no longer in the UniProt system (so I can't find an entry_name for it). I was thinking in order to combat this, I could find the corresponding ENSPs for GRCh38 and then find their UniProtKB-SwissProtIDs since it should be more up to date. The issue is, I don't know how to map the old ENSPs to the new ones.
If I understand it correctly - you have a list of Ensembl Translation/Protein stable IDs (ENSPs) for both GRCh37 and GRCh38 and for both the lists you would like to independently find the corresponding Uniprot IDs? I feel one of the easiest way to map Ensembl IDs to external reference IDs is using biomart tool from Ensembl which could also be used online (here are the version of the tool for GRCh37 and for GRCh38) In attributes - select what type of external ids do you want to map to, in your case I guess this would be UniProtKB-SwissProtID/TrEMBLID and in filters - select the id that you are providing, in your case this would be Protein stable IDs and upload the file containing the stable IDs
Thank you for your response. I started with a list of Ensembl Translation/Protein stable IDs (ENSPs) for GRCh37 and I want to find their UniProtKB-SwissProtIDs. The issue I am having is that when I use BioMart, there are some UniProtKB-SwissProtIDs included that are no longer in the UniProt system (so I can't find an entry_name for it). I was thinking in order to combat this, I could find the corresponding ENSPs for GRCh38 and then find their UniProtKB-SwissProtIDs since it should be more up to date. The issue is, I don't know how to map the old ENSPs to the new ones.
You can use the UniProt batch retrieval via this link
https://www.uniprot.org/id-mapping
to find current accession numbers for your obsolete identifiers. Just upload your list and map from UniProtKB to UniProtKB (or to UniProtKB/Swiss-Prot if you only want reviewed entries returned).
In case of doubt, please don't hesitate to contact the UniProt helpdesk.
Hi MB - just to add to the other replies here, you could also look into using the Ensembl ID History Convertor, which allows you to input a list of Ensembl IDs from a previous Ensembl release, and find what IDs they map to in the current release.: https://www.ensembl.org/Homo_sapiens/Tools/IDMapper
Thank you for your response. When I try to use the converter to convert 'ENSP00000221740' (old) into the new Ensemble Protein ID, it doesn't give me the new ID, which makes me think that it has not changed. But when I search it on the updated version, nothing comes up. I then grabbed the gene name from the archived website: http://apr2022.archive.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000105141;r=19:15049480-15058293
When I search the gene name, I find the new protein ID is actually ENSP00000393417. http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000105141;r=19:15049480-15058293
So my question is, why didn't the converter give me the new ENSP ID? Am I using it wrong? Here is a link for the results I got: https://uswest.ensembl.org/Homo_sapiens/Tools/IDMapper/Results?tl=OX3zo8qpS7d0HHvs-8486700
Hi MB,
This is because ENST00000221740/ENSP00000221740 does not map to any features in the current gene set:
[1] https://www.ensembl.org/Homo_sapiens/Transcript/Idhistory?t=ENST00000221740
You can see the differences between the transcripts annotated for CASP14 in the GRCh37 and GRCh38 assemblies on the following pages:
[2] https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000105141;r=19:15049480-15058293
[3] https://grch37.ensembl.org/Homo_sapiens/Transcript/ProteinSummary?db=core;g=ENSG00000105141;r=19:15163015-15166900;t=ENST00000221740