For example, if I have a list of variants like this:
Gene_ID Transcript Coding Amino_Acid_Change
TP53 NM_000546 c.G830T p.C277F
TP53 NM_001126112 c.G830T p.C277F
TP53 NM_001126113 c.G830T p.C277F
TP53 NM_001126114 c.G830T p.C277F
TP53 NM_001126115 c.G434T p.C145F
TP53 NM_001126116 c.G434T p.C145F
TP53 NM_001126117 c.G434T p.C145F
TP53 NM_001126118 c.G713T p.C238F
How could you figure out which of the transcripts is the canonical transcript?
Supposedly, transcripts are listed in places like UCSC, RefSeq, and Ensembl. But I have gone through each of these and have not been able to find anything that resembles the information I've listed above (ANNOVAR RefGene annotation output). The closest I've come is the UCSC Table Browser returning 'knownCanonical' for UCSC genes, but this is in a BED-style output with identifiers that do not resemble my given data. ANNOVAR's own documentation says that it does not support any differential reporting for canonical transcripts.
yes that is the same information I got from UCSC previously, however its records are in this format:
chr19 58310448 58326933 49943 uc284pmy.1 ENSG00000283103.1
I am not sure how to rectify this with the format I have from ANNOVAR RefGene output
This thread uses table browser to get related ids for all canonical transcripts. http://redmine.soe.ucsc.edu/forum/index.php?t=tree&th=7602&mid=19939&S=f6391396b0b0a7bcd539e058e8edc96b&rev=&reveal=
Looks like that page no longer exists...
However it looks like the script I wrote to implement the the cross-referencing described there is still listed here. The gist of it is to match the ID's in knownCanonical.txt against the ones in kgXref.txt. I believe there is also a saved copy of the output in the GitHub repo at that script location as well.