I downloaded several canonical GRCh38 gtf files, but I found the number of transcripts are different in each gtf. I was hoping to find one with 198838 transcripts. Does anyone know where I could find a list showing the number of transcripts as to each version of canonical gtf file?
Edit: I have found that on the website of GenCode, there is a summary of transcripts number GenCode_summary. Also, in Ensembl, there is another series of annotation. How could I find one annotation file with a specific number of transcripts? As I said, 198838 transcripts?
Why are you looking for an annotation with exactly 198838 transcripts?
That's roughly the number of genes identified by GENCODE, but it includes upward of 40,000 pseudogenes and many more dozens of 1000s of transcripts that undergo nonsense mediated decay.
The last time that I did a RNA-seq experiment with Kallisto, I used the GENCODE FASTA v24 as reference and it had 199,170 transcripts, which includes all genes and known transcript isoforms. Here's a direct link to the listing (on my box dot com page), in case you're interested: https://app.box.com/s/bx03lewrwwxagfxcm5qkpfcv7p3gq09y
On the archives of Ensembl (https://www.ensembl.org/info/website/archives/index.html ) you can find old versions. In each case you can find under "More information and statistics" link for human genome detailed breakdown of the number of different features in particular assembly. An example from Jul2015 is here: http://jul2015.archive.ensembl.org/Homo_sapiens/Info/Annotation None of the assemblies I've quickly checked had exactly 198838 transcripts (some had it close, but not equal to that number).
Thanks so much for the reply. I just found that there were some problems using Tablemaker to process. The number of transcripts in the Tablemaker output is not consistent with the reference annotation.