Ensembl ID mapping GRCh37 vs GRCh38
0
0
Entering edit mode
2.3 years ago
MB • 0

I currently have a large list of Ensembl protein IDs (ENSP) that are from GRCh37. I need to map these IDs to the entry name listed on the UniProt website (e.g. 'CASPE_HUMAN' ). I am having trouble doing this using the UniProt dataset since it is up to date with the GRCh38 Ensembl IDs. Right now, I have a dataset that maps GRCh37 IDs to UniProtKB-AC (e.g. P31944)- some of these UniProt IDs are obsolete though. Is there a way I can see which Ensembl IDs have been updated in GRCh38 version? My overall goal is to find the updated UniProt IDs for the list of GRCh37_IDs I have.

I would love to have a dataframe that looks like (currently using Python):

GRCh37_ID      GRCh38_ID                      Old UniProt           New UniProt
ENSP001            ENSP001                      P1234                    P1234
ENSP002            ENSP004                      P4567                    P5632
ENSP003            ENSP009                      P1292                    P1292
ENSP004            ENSP0012                     P1434                    P2434

After this, I could just grab the new Uniprot ID that corresponds to my old GRCh37_IDs to find the entry name. Is this possible? I've been struggling to figure this out.

Recap: I started with a list of Ensembl Translation/Protein stable IDs (ENSPs) for GRCh37 and I want to find their UniProtKB-SwissProtIDs. The issue I am having is that when I use BioMart, there are some UniProtKB-SwissProtIDs included that are no longer in the UniProt system (so I can't find an entry_name for it). I was thinking in order to combat this, I could find the corresponding ENSPs for GRCh38 and then find their UniProtKB-SwissProtIDs since it should be more up to date. The issue is, I don't know how to map the old ENSPs to the new ones.

UniProt python Ensembl • 1.7k views
ADD COMMENT
0
Entering edit mode

If I understand it correctly - you have a list of Ensembl Translation/Protein stable IDs (ENSPs) for both GRCh37 and GRCh38 and for both the lists you would like to independently find the corresponding Uniprot IDs? I feel one of the easiest way to map Ensembl IDs to external reference IDs is using biomart tool from Ensembl which could also be used online (here are the version of the tool for GRCh37 and for GRCh38) In attributes - select what type of external ids do you want to map to, in your case I guess this would be UniProtKB-SwissProtID/TrEMBLID and in filters - select the id that you are providing, in your case this would be Protein stable IDs and upload the file containing the stable IDs

ADD REPLY
0
Entering edit mode

Thank you for your response. I started with a list of Ensembl Translation/Protein stable IDs (ENSPs) for GRCh37 and I want to find their UniProtKB-SwissProtIDs. The issue I am having is that when I use BioMart, there are some UniProtKB-SwissProtIDs included that are no longer in the UniProt system (so I can't find an entry_name for it). I was thinking in order to combat this, I could find the corresponding ENSPs for GRCh38 and then find their UniProtKB-SwissProtIDs since it should be more up to date. The issue is, I don't know how to map the old ENSPs to the new ones.

ADD REPLY
1
Entering edit mode

You can use the UniProt batch retrieval via this link

https://www.uniprot.org/id-mapping

to find current accession numbers for your obsolete identifiers. Just upload your list and map from UniProtKB to UniProtKB (or to UniProtKB/Swiss-Prot if you only want reviewed entries returned).

In case of doubt, please don't hesitate to contact the UniProt helpdesk.

ADD REPLY
0
Entering edit mode

Hi MB - just to add to the other replies here, you could also look into using the Ensembl ID History Convertor, which allows you to input a list of Ensembl IDs from a previous Ensembl release, and find what IDs they map to in the current release.: https://www.ensembl.org/Homo_sapiens/Tools/IDMapper

ADD REPLY
0
Entering edit mode

Thank you for your response. When I try to use the converter to convert 'ENSP00000221740' (old) into the new Ensemble Protein ID, it doesn't give me the new ID, which makes me think that it has not changed. But when I search it on the updated version, nothing comes up. I then grabbed the gene name from the archived website: http://apr2022.archive.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000105141;r=19:15049480-15058293

When I search the gene name, I find the new protein ID is actually ENSP00000393417. http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000105141;r=19:15049480-15058293

So my question is, why didn't the converter give me the new ENSP ID? Am I using it wrong? Here is a link for the results I got: https://uswest.ensembl.org/Homo_sapiens/Tools/IDMapper/Results?tl=OX3zo8qpS7d0HHvs-8486700

ADD REPLY
0
Entering edit mode

Hi MB,

This is because ENST00000221740/ENSP00000221740 does not map to any features in the current gene set:

[1] https://www.ensembl.org/Homo_sapiens/Transcript/Idhistory?t=ENST00000221740

You can see the differences between the transcripts annotated for CASP14 in the GRCh37 and GRCh38 assemblies on the following pages:

[2] https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000105141;r=19:15049480-15058293

[3] https://grch37.ensembl.org/Homo_sapiens/Transcript/ProteinSummary?db=core;g=ENSG00000105141;r=19:15163015-15166900;t=ENST00000221740

ADD REPLY

Login before adding your answer.

Traffic: 2302 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6