Hello,
We have an FAQ page that covers this topic (http://genome.ucsc.edu/FAQ/FAQgenes.html#singledownload). As posted by ATpoint, it boils down to different datasets and different approaches.
hg19 knownCanonical was last updated in 2013 and built primarily from RefSeq and GenBank sequences and a few other sources. One isoform was identified from each gene (as defined by UCSC IDs) which was typically the longest isoform.
For hg38, knownCanonical was last built on the GENCODE v36 models earlier this year. In this case one canonical isoform was chosen per ENSEMBL gene ID. The hierarchy for which was chosen is described in the FAQ page.
So while the tables have the same name, they originate from different data and different designations of a gene.
A more recent and 'standardized' approach is to use NCBI's RefSeq Select transcripts (https://www.ncbi.nlm.nih.gov/refseq/refseq_select/). These are NCBI's pick of a single representative transcript for every protein-coding gene. if you compare these numbers across hg19 and hg38, you'll see they are very similar:
#Assembly #tableName #count
hg19 ncbiRefSeqSelect 21461
hg38 ncbiRefSeqSelect 21763
Sometime in the near future the MANE project (https://www.ncbi.nlm.nih.gov/refseq/refseq_select/#MANE) will release a list of canonical transcripts for hg38 that are standardized between RefSeq and GENCODE.
If you have any follow up questions, our public help desk can always be reached at genome@soe.ucsc.edu. You may also send questions to genome-www@soe.ucsc.edu if they contain sensitive data. For any Genome Browser questions on Biostars, the UCSC tag is the best way to ensure visibility by the team.
A: which annotation I should for RNA-seq? Ensembl, UCSC or refseq?
You are not only comparing two different genome builds here but also two different consortia (RefSeq vs GENCODE). Ensembl uses GENCODE, see linked answer.
Also: https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-16-S8-S2
Or simply search for Gencode vs RefSeq on the web, there are many posts addressing this.
Yes, it's RefSeq vs. GENCODE. But still it amazes me that there's 2x difference in canonical transcripts.
A canonical transcript is linked to a gene (maybe not strictly, but if we consider a cluster a gene that's what it translates to). So does that mean that when we use hg19/RefSeq based canonical transcripts, we are missing half of expressed regions?