I am trying to use Ensembl transcript ids to identify the amino acid sequences they encode.
My problem is that I have found the below example in which the transcript ID's related to a given amino acid sequence appear 'swapped' when going from a version of Ensembl that uses genome build 37 to one that uses genome build 38. Note that the 'Name' attribute has not swapped between the AA sequences. Note that when asserting that the Ensembl transcript Ids have been swapped I have ignored the version numbers after the decimal point in the id.
My questions are
How prevalent is this sort of swap?
Should I instead be using the 'Name' attribute of an Ensembl transcript, if I want an ID that is stable with respect to the amino acid sequence
Have I missed something obvious? I am just getting started with these data sources.
Many thanks,
Matt
Ensembl v86, using genome build 37
Name Transcript Id Bp Protein
--------------------------------------------
RTFDC1-001 ENST00000023939.4 1650 306aa
RTFDC1-201 ENST00000357348.5 1476 336aa
Ensembl version 88, using genome build 38
Name Transcript Id Bp Protein
--------------------------------------------
RTFDC1-001 ENST00000357348.9 2212 306aa
RTFDC1-201 ENST00000023939.8 1745 336aa
I had a similar problem with ENST instability even in the same genome build some time ago.
I summarised the answer from the Ensembl helpdesk here in my own question:
Stability of Ensembl and refseq stable IDs
I first thought here it is the same. It is not what you expect for a stable ID. But it can happen.
But I just noticed Ensembl recently changed their stable ID documentation:
As far as I understand the “Mapping stable identifiers“ part for assignment of transcript IDs the protein sequence or protein length is not taken into account.
“… The identity of a transcript is thus defined by the list of its exon coordinates and its underlying sequence. “ – not the protein sequence! So if I understand their documentation right, in most cases it will stay the same but it is not part of the similarity comparison in the mapping apart from the additional penalty described here for total change of transcript function! So protein sequence similarity is not the central idea of ENST and not guaranteed by the mapping between versions.
So my answers to your questions in detail:
How prevalent is this sort of swap?
I don’t know. The "important" transcripts I work with are stable protein-wise. I always wanted to find out for the general case of protein sequence change, but never found the time.
Should I instead be using the 'Name' attribute of an Ensembl transcript, if I want an ID, that is stable with respect to the amino acid sequence?
No. It also doesn’t guarantee stability. Have a look at SYNGAP1 for example and compare it in version 88 and 75 of Ensembl. The name does not guarantee any stability either.
If you want to be sure nothing is changed use ENST combined with the version of Ensembl you took your data from to get the same sequence again.
If you have to compare transcripts over different Ensembl versions, maybe it helps to keep track of the CCDS-IDs or versioned ENSP and find the transcript, mapping to it in each new Ensembl version.
If you want to dig into it, have a look at this tool: http://ugahash.uni-frankfurt.de/ It might help to spot differences in transcripts.
Have I missed something obvious? I am just getting started with these data sources.
I don’t think so. If you are just starting, you spotted this kind of problem faster than most I guess ;)
(unversioned) ENSTs are used widely in places, where they don’t make any sense in the long run.
Our stable identifier mapping strategy aims to conserve stable IDs as much as possible.
This involves finding the best match between two gene sets while allowing some mismatches. Any difference in sequence or exon coordinates will result in a version increment.
In this example, the two transcripts have had large UTR sequences added on both 3' and 5' ends (over a third of the overall sequence for ENST00000023939.4) which makes it harder to map to the correct identifier.
We work on reducing these edge cases to a minimum, but their occurrences do increase when moving between assemblies, as the models are more likely to change when the underlying assembly has changed.
To ensure you are referring to the same sequence and structure, please use the versioned stable identifiers, for example ENSP00000023939.4 in your example.
We advise against using the Name for reliably identifying a transcript, as these are not guaranteed to be stable.
Tagging: Emily_Ensembl