I have downloaded the latest GTF file (Homo_sapiens.GRCh37.75.gtf) from Ensembl,I was trying to find all protein coding genes and calculte the exome size, I did as follows:
awk '{if($3=="gene" && $2=="protein_coding"){print $0}}' Homo_sapiens.GRCh37.75.gtf
I found 22810 protein_coding genes, with a total length of 1,395,684,274 bp ( | awk '{sum+=$5-$4}END{print sum}'
); 1.3G seems too large for me.
Similarly when I search for all exons
awk '{if($3=="exon" && $2=="protein_coding"){print $0}}' Homo_sapiens.GRCh37.75.gtf
I found 809933 exons, with a total length of 191,357,777 bp
As far as I remember, in a coure I attended before, we have tried R to calculate the total exome size of human, which is ~30M, why my calculation here is so big? did I made some mistakes?
Thanks for your help.
I have another naive questions, since intron is not included in the gtf (is it used to?), how I infer the intron regions, regions in the gene but without annotion should be the intron region? am I right? or these are other easy method?
Introns aren't explicitly included since they're just the regions between the exons. You can add them in and then use the same methods to calculate their size. Example scripts are provided here in R (from me) and perl (from Alejandro Reyes).
thanks a lot for your help