Question

How to get known canonical transcript information from UCSC for a specific gencode version

0

Entering edit mode

5.9 years ago

komal.rathi ★ 4.1k

Hi,

I am using UCSC genome browser to get known canonical transcripts using this link. This is the default Gencode version V29 and I am able to set the table to knownCanonical. However, when I change the gencode version to ALL Gencode V23 under track, the table options change and I can no longer access any tables corresponding to knownCanonical.

Does anybody know how I can get the canonical transcript info for gencode v23?

Thanks!

gencode ucsc canonical transcripts • 4.9k views

ADD COMMENT • link 5.9 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

You should ask this over at UCSC Genome browser help desk. Someone from UCSC swings by here but they may not do so right away.

ADD REPLY • link 5.9 years ago by GenoMax 150k

1

Entering edit mode

Thanks, I will do that. I will keep this open and post any responses I get from the help desk.

ADD REPLY • link 5.9 years ago by komal.rathi ★ 4.1k

score 5 · Accepted Answer · 2019-05-20

I got a response from the UCSC Genome Browser help desk which resolved my question:

The knownCanonical gene set is created from the longest transcript of the basic Gencode gene set. This convention was not around for the V23 gene set, so that file does not exist. If you would like to use a similar dataset without filtering for only the longest transcripts, you can use the Basic annotation set from Gencode V23 (http://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_track=wgEncodeGencodeV23).

Alternately, you can filter out shorter transcripts, leaving the longest isoforms of each transcript by running a short script from the command line.
mysql -h genome-mysql.soe.ucsc.edu -u genome -Ne "select g.name, a.geneId, g.txEnd-g.txStart from wgEncodeGencodeBasicV23 g,
wgEncodeGencodeAttrsV23 a where g.name = a.transcriptId" hg38 | sort
-rnk 3 | awk '{if (!found[$2]) print ; found[$2] = 1}' | awk '{print $2}' > knownCanonicalV23.txt
  
The output of this script (knownCanonicalV23.txt) can be uploaded as identifier input in Table Browser. Using that file as Table Browser identifiers should allow output as if you were querying a knownCanonical data set from Gencode V23.

If you want to download a genePred file equivalent of knownCanonical for Gencode V23, you can run the following script on the command line.
mysql -h genome-mysql.soe.ucsc.edu -u genome -Ne "select   g.txEnd-g.txStart, a.geneId, g.* from wgEncodeGencodeBasicV23 g,
wgEncodeGencodeAttrsV23 a where g.name = a.transcriptId" hg38 | sort
-rn | awk '{if (!found[$2]) print ; found[$2] = 1}' | cut -f 4-  > knownCanonicalV23.gp