Hello Vijay,
What you are observing here with the missing BBS5 entry in the knownCanonical table is an artifact of how that table was created for hg19. If you take a look at the hg19 UCSC Genes description page (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene ) we define knownCanonical as the following:
knownCanonical identifies the canonical isoform of each cluster ID, or gene. Generally, this is the longest isoform.
The problem is, however, when two genes have overlapping coordinates, and one of them is entirely within another, the algorithm considers them isoforms and the smaller gene will be missed by knownCanonical. You can see this with BBS5 by going to the following session (http://genome.ucsc.edu/s/Lou/hg19_MLQ1 ). KLHL41 has a transcript with the same start site as BBS5, however, it extends much further. All of the BBS5 transcripts fall within it. If you query the Table Browser for these coordinates you see only KLHL41.
Using coordinates chr2:170,331,250-170,374,046:
chr2 170366211 170382772 17243 uc002ueu.1 uc002ueu.1 KLHL41
In order to get around this, you can use the complete knownGenes table, or you could also use the knownCanonical table for hg38. For the hg38 assembly (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=knownGene ) the table was generated differently:
knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.
This new method did not have the same issue as hg19, as it uses APPRIS tags, then GENCODE sets, and then finally if those are not available the longest isoform. If you convert the region from the session above to hg38 (View in the top bluebar -> In Other Genomes) you will get the following coordinates (chr2:169,474,740-169,517,536), then if you query the position on the knownCanonical table on the Table Browser you get the following results:
chr2 169479177 169506655 10932 ENST00000295240.7 ENSG00000163093.11
BBS5
chr2 169479479 169525922 41682 ENST00000513963.1 ENSG00000251569.1
AC093899.2
chr2 169509701 169526262 36897 ENST00000284669.1 ENSG00000239474.6
KLHL41
UPDATE#1
In fact, I noticed that there are 1325 genes which are there in the
knownGenes
table but not present inknownCanonical
table.UPDATE#2
There is striking difference between the row counts of the 2 tables.
Schema for knownGene Row Count: 82,960
Schema for knownCanonical Row Count: 31,848
knownCanonical
is generally the longest isoform so it is not surprising that the number is smaller. See the definitions under Related Data section on this page.genomax I agree about the definition of "canonical". Numbers shown in the venn diagram are count of unique gene entries in both sets. Hence, it does not explain why the list of genes in "knownCanonical" is smaller (see Venn above). Every gene must have once canonical isoform.
Actually I am trying to replicate the steps mentioned in the below post:
How To Get Bed File Containing Exons Of Canonical Transcripts And Their Corresponding Gene Symbols
The problem is that there are several genes missing from the final BED file.