Question

UCSC knownCanonical hg19 vs. hg38

1

Entering edit mode

4.3 years ago

Sanjar ▴ 150

I searched at UCSC Genome Browser's website, but couldn't find relevant information.

The hg19 version of the canonical transcripts (knownCanonical) has 32K entries whereas hg38 version (GENCODE based) has 66K entries.

What accounts for this 2x difference?

genome UCSC hg19 hg38 canonical • 4.5k views

ADD COMMENT • link updated 3.8 years ago by Luis Nassar ▴ 670 • written 4.3 years ago by Sanjar ▴ 150

1

Entering edit mode

A: which annotation I should for RNA-seq? Ensembl, UCSC or refseq?

You are not only comparing two different genome builds here but also two different consortia (RefSeq vs GENCODE). Ensembl uses GENCODE, see linked answer.

Also: https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-16-S8-S2

Or simply search for Gencode vs RefSeq on the web, there are many posts addressing this.

ADD REPLY • link 4.3 years ago by ATpoint 88k

0

Entering edit mode

Yes, it's RefSeq vs. GENCODE. But still it amazes me that there's 2x difference in canonical transcripts.

A canonical transcript is linked to a gene (maybe not strictly, but if we consider a cluster a gene that's what it translates to). So does that mean that when we use hg19/RefSeq based canonical transcripts, we are missing half of expressed regions?

ADD REPLY • link 4.3 years ago by Sanjar ▴ 150

score 2 · Answer 1 · 2021-09-08

2

Entering edit mode

3.8 years ago

Luis Nassar ▴ 670

Hello,

We have an FAQ page that covers this topic (http://genome.ucsc.edu/FAQ/FAQgenes.html#singledownload). As posted by ATpoint, it boils down to different datasets and different approaches.

hg19 knownCanonical was last updated in 2013 and built primarily from RefSeq and GenBank sequences and a few other sources. One isoform was identified from each gene (as defined by UCSC IDs) which was typically the longest isoform.

For hg38, knownCanonical was last built on the GENCODE v36 models earlier this year. In this case one canonical isoform was chosen per ENSEMBL gene ID. The hierarchy for which was chosen is described in the FAQ page.

So while the tables have the same name, they originate from different data and different designations of a gene.

A more recent and 'standardized' approach is to use NCBI's RefSeq Select transcripts (https://www.ncbi.nlm.nih.gov/refseq/refseq_select/). These are NCBI's pick of a single representative transcript for every protein-coding gene. if you compare these numbers across hg19 and hg38, you'll see they are very similar:

#Assembly #tableName #count
hg19 ncbiRefSeqSelect 21461
hg38 ncbiRefSeqSelect 21763

Sometime in the near future the MANE project (https://www.ncbi.nlm.nih.gov/refseq/refseq_select/#MANE) will release a list of canonical transcripts for hg38 that are standardized between RefSeq and GENCODE.

If you have any follow up questions, our public help desk can always be reached at genome@soe.ucsc.edu. You may also send questions to genome-www@soe.ucsc.edu if they contain sensitive data. For any Genome Browser questions on Biostars, the UCSC tag is the best way to ensure visibility by the team.

ADD COMMENT • link 3.8 years ago by Luis Nassar ▴ 670

2

Entering edit mode

Sometime in the near future the MANE project (https://www.ncbi.nlm.nih.gov/refseq/refseq_select/#MANE) will release a list of canonical transcripts for hg38 that are standardized between RefSeq and GENCODE.

These data are available now. See: https://www.ncbi.nlm.nih.gov/refseq/MANE/ for more details, and, this section for data access.

ADD REPLY • link 3.8 years ago by vkkodali_ncbi ★ 3.8k

1

Entering edit mode

They are, but the latest version is .95, which covers almost all genes. We combined that with refseq select (and removed duplicates) for our current hg38 track.

ADD REPLY • link 3.8 years ago by Luis Nassar ▴ 670

0

Entering edit mode

Excellent! I was somewhat dissuaded by the description on the page that only about half of the genes, but the FTP site seems to have newer versions.

ADD REPLY • link 3.8 years ago by Sanjar ▴ 150

0

Entering edit mode

That's a good point. I'll update that now.

We also host a track (https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&c=chr1&g=mane) which is .92 but will be updated to the latest .95 next week. I don't know the exact number, but .95 covers 90-95% of genes (it's slightly less than version number).

ADD REPLY • link 3.8 years ago by Luis Nassar ▴ 670

0

Entering edit mode

Thank you!

ADD REPLY • link 3.8 years ago by Sanjar ▴ 150

0

Entering edit mode

Thank you Luis for an exhaustive answer! This looks great, we don't have to liftOver the hg38 transcripts to hg19. One more question, are the coordinates of the exons in knownGene table "standardized" across the mentioned consortia too?

ADD REPLY • link 3.8 years ago by Sanjar ▴ 150

1

Entering edit mode

They are not specifically standardized, that's really what MANE is set to accomplish. Currently NCBI (RefSeq) and EBI (GENCODE) have a large amount of overlap with regards to their gene models as you would expect since their goal is the same. However, there are some differences.

knownGene for hg19 closely matches RefSeq but may have some major differences from GENCODE since it has not been updated in a long time, I have not looked at this closely.

knownGene for hg38 matches GENCODE, and the exon coordinates should be similar to RefSeq, but certainly not standard or the same. For one, the GENCODE comprehensive list has 232K items vs RefSeq's 173k.

So in short I'd expect most genes for hg38 would have equivalent exon coordinates (especially well annotated and researched genes), but not all.

ADD REPLY • link 3.8 years ago by Luis Nassar ▴ 670