I was wondering if it is possible to recover the reported counts of known genes, novel genes, exons, etc. in ensembl species information page by parsing gtf file?
For example of Mus musculus, its ensembl page (http://useast.ensembl.org/Mus_musculus/Info/StatsTable?db=core ) shows the following:
Gene counts
Known genes: 21,886
Novel genes: 531
Putative genes: 290
Pseudogenes: 5,482
RNA genes: 7,541
Immunoglobulin/T-cell receptor gene segments: 481
Gene exons: 416,230
Gene transcripts: 97,639
I downloaded the gene annotation file gtf from ftp://ftp.ensembl.org/pub/release-67/gtf/mus_musculus/
The version is identical.
I was able to correctly recover the gene transcripts
$cat Mus_musculus.NCBIM37.67.gtf | awk '{print $12}' | tr -d ';"' | sort | uniq | wc -l
97639
But that's all. For all the other counts such as Known genes, novel genes, putative genes, etc. I couldn't recover the reported counts.
In attempting to count the exons, I tried the following command
$cat Mus_musculus.NCBIM37.67.gtf| awk '$3~/exon/ {print $3}' | wc -l
689492
which is the overestimation.
I also tried to count the gene counts with the following
$cat Mus_musculus.NCBIM37.67.gtf| awk '{print $10}' | tr -d ';"' | sort | uniq | wc -l
37991
which is also an overestimation. Even if you add known genes + novel genes, putative genes, pseudogenes + RNA genes + gene segements, you get 36211, which is different from the result of the parsing.
What am I doing wrong? Thank you in advance