calculating the exome size from latest GTF file
2
0
Entering edit mode
10.6 years ago
Mingkun ▴ 40

I have downloaded the latest GTF file (Homo_sapiens.GRCh37.75.gtf) from Ensembl,I was trying to find all protein coding genes and calculte the exome size, I did as follows:

awk '{if($3=="gene" && $2=="protein_coding"){print $0}}' Homo_sapiens.GRCh37.75.gtf

I found 22810 protein_coding genes, with a total length of 1,395,684,274 bp ( | awk '{sum+=$5-$4}END{print sum}'); 1.3G seems too large for me.

Similarly when I search for all exons

awk '{if($3=="exon" && $2=="protein_coding"){print $0}}' Homo_sapiens.GRCh37.75.gtf

I found 809933 exons, with a total length of 191,357,777 bp

As far as I remember, in a coure I attended before, we have tried R to calculate the total exome size of human, which is ~30M, why my calculation here is so big? did I made some mistakes?

Thanks for your help.

gtf calculation exome • 5.4k views
ADD COMMENT
0
Entering edit mode

I have another naive questions, since intron is not included in the gtf (is it used to?), how I infer the intron regions, regions in the gene but without annotion should be the intron region? am I right? or these are other easy method?

ADD REPLY
0
Entering edit mode

Introns aren't explicitly included since they're just the regions between the exons. You can add them in and then use the same methods to calculate their size. Example scripts are provided here in R (from me) and perl (from Alejandro Reyes).

ADD REPLY
0
Entering edit mode

thanks a lot for your help

ADD REPLY
3
Entering edit mode
10.6 years ago

If a gene has multiple transcripts then you'll end up double/triple/etc. counting some of its exons with that method. In R, you can import the GTF, split() by gene_id and then reduce() that to get non-overlapping exons for this purpose. That'd be a real pain in awk.

ADD COMMENT
0
Entering edit mode

I thought exon won't have this problem (differring from transicript), I will have a look, thanks for your answer.

ADD REPLY
0
Entering edit mode

You are right, I have found this:

1       protein_coding  gene    860260  879955 gene_id "ENSG00000187634"; gene_name "SAMD11"
1       protein_coding  gene    861264  866445 gene_id "ENSG00000268179"; gene_name "AL645608.1";
ADD REPLY
2
Entering edit mode
10.6 years ago
Irsan ★ 7.8k

First merge overlapping features with bedops and then count:

awk '{if($3=="exon" && $2=="protein_coding"){print $0}}' Homo_sapiens.GRCh37.75.gtf | gtf2bed - | bedops -m - | awk 'BEGIN{FS="\t";count=0}{count=count + ($3-$2)}END{print count}'
ADD COMMENT
0
Entering edit mode

Through this method, I got the size of protein_coding/exon equals to 81M, the size of protein_coding/gene equals to 1.3G, are these numbers make sense?

ADD REPLY
0
Entering edit mode
sounds like the right number. you can also import the merged bed file into IGV and see how it overlaps with the annotation
ADD REPLY

Login before adding your answer.

Traffic: 1630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6