Question

How to tell which transcript is the canonical transcript?

6

Entering edit mode

8.2 years ago

steve ★ 3.5k

For example, if I have a list of variants like this:

Gene_ID Transcript  Coding  Amino_Acid_Change
TP53    NM_000546   c.G830T p.C277F
TP53    NM_001126112    c.G830T p.C277F
TP53    NM_001126113    c.G830T p.C277F
TP53    NM_001126114    c.G830T p.C277F
TP53    NM_001126115    c.G434T p.C145F
TP53    NM_001126116    c.G434T p.C145F
TP53    NM_001126117    c.G434T p.C145F
TP53    NM_001126118    c.G713T p.C238F

How could you figure out which of the transcripts is the canonical transcript?

Supposedly, transcripts are listed in places like UCSC, RefSeq, and Ensembl. But I have gone through each of these and have not been able to find anything that resembles the information I've listed above (ANNOVAR RefGene annotation output). The closest I've come is the UCSC Table Browser returning 'knownCanonical' for UCSC genes, but this is in a BED-style output with identifiers that do not resemble my given data. ANNOVAR's own documentation says that it does not support any differential reporting for canonical transcripts.

annovar • 13k views

ADD COMMENT • link 8.2 years ago by steve ★ 3.5k

score 4 · Answer 1 · 2016-08-30

4

Entering edit mode

8.2 years ago

microfuge ★ 1.9k

I presume canonical information can be downloaded from here http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/knownCanonical.txt.gz . But I have not used it. For the species I use lacking such information, I usually determine the longest protein and use it as canonical.
I don't know why biomart does not provide that kind of important information. Also sorry i accidentally added it as an answer when I meant a comment.

ADD COMMENT • link 8.2 years ago by microfuge ★ 1.9k

0

Entering edit mode

yes that is the same information I got from UCSC previously, however its records are in this format:

chr19 58310448 58326933 49943 uc284pmy.1 ENSG00000283103.1

I am not sure how to rectify this with the format I have from ANNOVAR RefGene output

ADD REPLY • link 8.2 years ago by steve ★ 3.5k

1

Entering edit mode

This thread uses table browser to get related ids for all canonical transcripts. http://redmine.soe.ucsc.edu/forum/index.php?t=tree&th=7602&mid=19939&S=f6391396b0b0a7bcd539e058e8edc96b&rev=&reveal=

ADD REPLY • link 8.2 years ago by microfuge ★ 1.9k

0

Entering edit mode

Looks like that page no longer exists...

ADD REPLY • link 5.3 years ago by steve ★ 3.5k

0

Entering edit mode

However it looks like the script I wrote to implement the the cross-referencing described there is still listed here. The gist of it is to match the ID's in knownCanonical.txt against the ones in kgXref.txt. I believe there is also a saved copy of the output in the GitHub repo at that script location as well.

ADD REPLY • link 5.3 years ago by steve ★ 3.5k

score 4 · Answer 2 · 2016-08-30

4

Entering edit mode

8.2 years ago

igor 13k

Those are RefSeq IDs, so you are looking for RefSeq info. There is a whole discussion about it here: https://groups.google.com/a/soe.ucsc.edu/forum/#!topic/genome/_6asF5KciPc

If you want to know what is really the "canonical" transcript, that's a whole different story. Canonical is not always canonical.

Update (8/2019): There have been some additional more helpful discussions:

Why the list of genes in UCSC "knownGene" table is strikingly different than the list of genes in UCSC "known canonical" table?
How to get known canonical transcript information from UCSC for a specific gencode version
How does VEP decide on canonical transcripts and is there a list?

ADD COMMENT • link 5.3 years ago by igor 13k

2

Entering edit mode

I'm not really sure that there is even such as thing as "canonical" in real life. This is presumably why ANNOVAR doesn't support such a distinction.

ADD REPLY • link 8.2 years ago by i.sudbery 20k

2

Entering edit mode

Agreed. I've seen different sources disagree on what is canonical even for well-known genes.

ADD REPLY • link 8.2 years ago by igor 13k