How to tell which transcript is the canonical transcript?
2
6
Entering edit mode
8.2 years ago
steve ★ 3.5k

For example, if I have a list of variants like this:

Gene_ID Transcript  Coding  Amino_Acid_Change
TP53    NM_000546   c.G830T p.C277F
TP53    NM_001126112    c.G830T p.C277F
TP53    NM_001126113    c.G830T p.C277F
TP53    NM_001126114    c.G830T p.C277F
TP53    NM_001126115    c.G434T p.C145F
TP53    NM_001126116    c.G434T p.C145F
TP53    NM_001126117    c.G434T p.C145F
TP53    NM_001126118    c.G713T p.C238F

How could you figure out which of the transcripts is the canonical transcript?

Supposedly, transcripts are listed in places like UCSC, RefSeq, and Ensembl. But I have gone through each of these and have not been able to find anything that resembles the information I've listed above (ANNOVAR RefGene annotation output). The closest I've come is the UCSC Table Browser returning 'knownCanonical' for UCSC genes, but this is in a BED-style output with identifiers that do not resemble my given data. ANNOVAR's own documentation says that it does not support any differential reporting for canonical transcripts.

annovar • 13k views
ADD COMMENT
4
Entering edit mode
8.2 years ago
microfuge ★ 1.9k

I presume canonical information can be downloaded from here http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/knownCanonical.txt.gz . But I have not used it. For the species I use lacking such information, I usually determine the longest protein and use it as canonical.
I don't know why biomart does not provide that kind of important information. Also sorry i accidentally added it as an answer when I meant a comment.

ADD COMMENT
0
Entering edit mode

yes that is the same information I got from UCSC previously, however its records are in this format:

chr19 58310448 58326933 49943 uc284pmy.1 ENSG00000283103.1

I am not sure how to rectify this with the format I have from ANNOVAR RefGene output

ADD REPLY
1
Entering edit mode

This thread uses table browser to get related ids for all canonical transcripts. http://redmine.soe.ucsc.edu/forum/index.php?t=tree&th=7602&mid=19939&S=f6391396b0b0a7bcd539e058e8edc96b&rev=&reveal=

ADD REPLY
0
Entering edit mode

Looks like that page no longer exists...

ADD REPLY
0
Entering edit mode

However it looks like the script I wrote to implement the the cross-referencing described there is still listed here. The gist of it is to match the ID's in knownCanonical.txt against the ones in kgXref.txt. I believe there is also a saved copy of the output in the GitHub repo at that script location as well.

ADD REPLY
4
Entering edit mode
8.2 years ago
igor 13k

Those are RefSeq IDs, so you are looking for RefSeq info. There is a whole discussion about it here: https://groups.google.com/a/soe.ucsc.edu/forum/#!topic/genome/_6asF5KciPc

If you want to know what is really the "canonical" transcript, that's a whole different story. Canonical is not always canonical.

Update (8/2019): There have been some additional more helpful discussions:

ADD COMMENT
2
Entering edit mode

I'm not really sure that there is even such as thing as "canonical" in real life. This is presumably why ANNOVAR doesn't support such a distinction.

ADD REPLY
2
Entering edit mode

Agreed. I've seen different sources disagree on what is canonical even for well-known genes.

ADD REPLY

Login before adding your answer.

Traffic: 2182 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6