Question

Question about UniProt Isoforms

0

Entering edit mode

10.4 years ago

pwg46 ▴ 540

So, if I go onto UniProt's website and type in O00142, for example. If I find the Ensembl section, it will show all of the ENST's that this protein maps to. It also shows ENSTs that map to the protein's isoforms (O00142-2,...). So, the list looks something like this

ENSTX -> O00142

ENSTY -> O00142-1

ENSTz -> O00142-2

Is there a difference between O00142 and O00142-1? Because I thought O00142-1 is the non-isoform, canonical, protein. What is the O00142 there for then? This actually causes me to run into problems as well. For example, if you take ENSTX's sequence (from the Ensembl database) and try to map each of its codons into an amino acid, the resulting sequence will NOT match O00142's aa-sequence. However ENSTY's resulting aa sequence does match O00142-1. This has always been the case when both a Uniprot and its -1 version exist. Should I just ignore the Uniprot without the -?

uniprot isoform ensembl enst • 3.5k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by pwg46 ▴ 540

Ram · Answer 1 · 2014-07-18

1

Entering edit mode

10.4 years ago

pld 5.1k

If you look under 'alt products' for the "-1" accession, it says:

This isoform has been chosen as the 'canonical' sequence. All positional information in this entry refers to it. This is also the sequence that appears in the downloadable versions of the entry.

I believe that the accession O00142 is the "Primary (citable) accession number: O00142", as in the entry for the protein in general. While the "-1" accession represents a specific isoform of that protein.

I'm not seeing where you get an ensembl transcript for the primary accession number, only links to the ensembl transcripts for the isoforms are listed under the CrossRefs sections.

ADD COMMENT • link 10.4 years ago by pld 5.1k

1

Entering edit mode

Note that the canonical sequence used in the UniProtKB entry is not always the '-1' isoform, and may change. So if you need to distinguish between the described isoforms the isoform identifer should always be used, do not assume that O00142 == O00142-1.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by hpmcwill ★ 1.2k

0

Entering edit mode

Joe is right, O00142 is the general accession number for this protein, it doesn't refer to one specific isoform. O00142-1, -2, -3, -4 (and -5) refer to the specific isoforms.

I also don't understand where you see an Ensembl transcript for O00142.

The only thing I see is:

ENST00000417693; ENSP00000407469; ENSG00000166548. [O00142-4]
ENST00000451102; ENSP00000414334; ENSG00000166548. [O00142-1]
ENST00000525974; ENSP00000434594; ENSG00000166548.
ENST00000527284; ENSP00000435312; ENSG00000166548. [O00142-2]
ENST00000545043; ENSP00000438143; ENSG00000166548. [O00142-3]

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.4 years ago by Bert Overduin ★ 3.7k

Ram · Answer 2 · 2014-07-19

An important observation for which the answer (but not a fix) lies in the conceptual differences between UniProt and Ensembl. As JC alludes to , in FASTA terms the primary accession is now a synonym for the -1 "variant". The problems arise because of challenges of synching the two that are also quasi-independant (exept that the Ensmbl Gene building heuristiscs obviously use UniProt ). They may thus agree for the max-exon version as -1 but not the others, certainly where the Swiss-Prot used to list variants (-2 etc) in order of curation. Note also that Ensembl may well "build" a spliced ORF that neither Swiss-Prot nor RefSeq nor CCDS will annotate without solid cDNAs, in this case 12 more