Question

is the lowest numbered Ensembl transcript ID always the "canonical" transcript?

1

Entering edit mode

3.1 years ago

Charles Murtaugh ▴ 50

I would like to identify a "canonical" transcript for every protein-coding gene in Ensembl. For project-related reasons, I'm using the EnsDb.Hsapiens.v75 package in R. I realize, of course, that "canonical" is a working definition at best, and inappropriate in some cases - but for ease of graphing some data I just want one transcript per gene for now. From manually inspecting genes in Ensembl, it looks like the lowest-numbered transcript ID for each corresponds to what I'm looking for. Some code to pull out a few examples:

library(EnsDb.Hsapiens.v75)
library(tidyverse)

genes <- keys(EnsDb.Hsapiens.v75, keytype='GENEID')
ensembl <- AnnotationDbi::select(EnsDb.Hsapiens.v75, keys=genes, keytype='GENEID',
                                 columns=c('GENEID', 'SYMBOL', 'GENEBIOTYPE'))
ensembl_cds <- filter(ensembl, GENEBIOTYPE=='protein_coding')
ensembl_cds_tx <- AnnotationDbi::select(EnsDb.Hsapiens.v75, keys=genes, keytype='GENEID',
                                        columns=c('SYMBOL', 'TXID'))
head(ensembl_cds_tx)

gois <- c('RSPO1', 'PRSS1', 'CDH1')
gois_tx <- filter(ensembl_cds_tx, SYMBOL %in% gois) %>% arrange(SYMBOL, TXID) %>% print()
gois_tx_lowest <- gois_tx[!duplicated(gois_tx$SYMBOL),] %>% print()

Each of the lowest transcript IDs pulled out above (ENST00000261769, ENST00000311737, ENST00000356545) corresponds to an Ensembl transcript for the respective genes (CDH1, PRSS1, RSPO1) that matches with Refseq and the Consensus CDS database. (Although, for RSPO1, there are three other transcripts that also have Refseq matches, which speaks to the arbitrariness of picking a single canonical transcript.)

My question is, is this the general practice across the Ensembl transcript database, that the lowest numbered transcript for a gene corresponds to a canonical or semi-canonical transcript, or have I just gotten lucky so far?

transcript ensembl bioconductor r • 1.3k views

ADD COMMENT • link updated 3.1 years ago by Emily 24k • written 3.1 years ago by Charles Murtaugh ▴ 50

1

Entering edit mode

Emily_Ensembl can clarify but I doubt that is the case.

We had talked about using data from MANE in one of your other threads. MANE probably represents the most current understanding of human transcripts (since that is an active project). If you are not finding genes in that set then they may have been reassigned/renamed/changed in some way.

ADD REPLY • link 3.1 years ago by GenoMax 147k

0

Entering edit mode

I'm pretty sure its not the case. In fact, if we define canonical as the transcript in the REFSEQ or CCDS releases of the same date, then I think there are quite a lot of cases in Ensembl v75 where there is no ensembl transcript that is a perfect match. I think in later releases of all three databases, a lot of work has been done to make them more comparable.

ADD REPLY • link 3.1 years ago by i.sudbery 20k

score 4 · Accepted Answer · 2021-11-10

No. The numbers are arbitrary. The canonical transcript is the one which is labelled canonical, which you can get as a filter or an attribute.

The stable IDs are assigned in order, so the first transcript every identified was ENST00000000001, the second ENST00000000002 etc. This means that for a gene, the one with the lowest number was the first one to be identified. In all probability, the first one identified is the one that is the most highly expressed, highly conserved and well-studied, which makes it coincidentally also the canonical. But it's not always the case.