Number of Transcripts in GRCh38
2
1
Entering edit mode
6.8 years ago
boyu93 ▴ 20

I downloaded several canonical GRCh38 gtf files, but I found the number of transcripts are different in each gtf. I was hoping to find one with 198838 transcripts. Does anyone know where I could find a list showing the number of transcripts as to each version of canonical gtf file?

Edit: I have found that on the website of GenCode, there is a summary of transcripts number GenCode_summary. Also, in Ensembl, there is another series of annotation. How could I find one annotation file with a specific number of transcripts? As I said, 198838 transcripts?

RNA-Seq genome sequencing • 3.5k views
ADD COMMENT
0
Entering edit mode

Why are you looking for an annotation with exactly 198838 transcripts?

ADD REPLY
0
Entering edit mode

That's roughly the number of genes identified by GENCODE, but it includes upward of 40,000 pseudogenes and many more dozens of 1000s of transcripts that undergo nonsense mediated decay.

The last time that I did a RNA-seq experiment with Kallisto, I used the GENCODE FASTA v24 as reference and it had 199,170 transcripts, which includes all genes and known transcript isoforms. Here's a direct link to the listing (on my box dot com page), in case you're interested: https://app.box.com/s/bx03lewrwwxagfxcm5qkpfcv7p3gq09y

ADD REPLY
0
Entering edit mode

On the archives of Ensembl (https://www.ensembl.org/info/website/archives/index.html ) you can find old versions. In each case you can find under "More information and statistics" link for human genome detailed breakdown of the number of different features in particular assembly. An example from Jul2015 is here: http://jul2015.archive.ensembl.org/Homo_sapiens/Info/Annotation None of the assemblies I've quickly checked had exactly 198838 transcripts (some had it close, but not equal to that number).

ADD REPLY
0
Entering edit mode

Thanks so much for the reply. I just found that there were some problems using Tablemaker to process. The number of transcripts in the Tablemaker output is not consistent with the reference annotation.

ADD REPLY
0
Entering edit mode
6.8 years ago
Denise CS ★ 5.2k

GENCODE calculates their stats taking into account the reference chromosomes only (check their README_stats.txt for the details), whereas the Ensembl provides the stats for reference chromosomes plus alternate sequences (haplotypes and patches).

If you go to the the Ensembl annotation page for human, you will see that their latest annotation "also includes 261 alt loci scaffolds, mainly in the LRC/KIR complex on chromosome 19 (35 alternate sequence representations) and the MHC region on chromosome 6 (7 alternate sequence representations)".

Your GENCODE v24 is on GRCh38.p5 (5th patched version of GRCh38) and the Ensembl annotation on GRCh38.p5 is available on their release 84. Ensembl reports 199,184 and the higher number is because of the transcripts annotated on the patches.

If you download the Homo_sapiens.GRCh38.84.gtf.gz from the Ensembl FTP release 84, you should be able to get the same numbers as the .gtf contains the annotation on the reference chromosomes, without patches and haplotypes.

ADD COMMENT
0
Entering edit mode

Thank you so much! I found the information I needed. And I traced back to the annotation file used while running cufflinks. It seems that there's some mistake during either cufflinks or Tablemaker processing, which caused duplicated transcript names in a file. (Using Stringtie to process will not have this issue). And by filtering out these duplicated transcripts, they are the same as the reference.

ADD REPLY
0
Entering edit mode

Just looking for a small cross verification. The number of transcripts is 199184; what is the number of genes in this build? Is it 60675? So after read assignment with FeatureCounts/Stringtie I expect to get count for 60675 genes?

ADD REPLY

Login before adding your answer.

Traffic: 2662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6