Question

calculating the exome size from latest GTF file

0

Entering edit mode

10.6 years ago

Mingkun ▴ 40

I have downloaded the latest GTF file (Homo_sapiens.GRCh37.75.gtf) from Ensembl，I was trying to find all protein coding genes and calculte the exome size, I did as follows:

awk '{if($3=="gene" && $2=="protein_coding"){print $0}}' Homo_sapiens.GRCh37.75.gtf

I found 22810 protein_coding genes, with a total length of 1,395,684,274 bp ( | awk '{sum+=$5-$4}END{print sum}'); 1.3G seems too large for me.

Similarly when I search for all exons

awk '{if($3=="exon" && $2=="protein_coding"){print $0}}' Homo_sapiens.GRCh37.75.gtf

I found 809933 exons, with a total length of 191,357,777 bp

As far as I remember, in a coure I attended before, we have tried R to calculate the total exome size of human, which is ~30M, why my calculation here is so big? did I made some mistakes?

Thanks for your help.

gtf calculation exome • 5.4k views

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 10.6 years ago by Mingkun ▴ 40

0

Entering edit mode

I have another naive questions, since intron is not included in the gtf (is it used to?), how I infer the intron regions, regions in the gene but without annotion should be the intron region? am I right? or these are other easy method?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Mingkun ▴ 40

0

Entering edit mode

Introns aren't explicitly included since they're just the regions between the exons. You can add them in and then use the same methods to calculate their size. Example scripts are provided here in R (from me) and perl (from Alejandro Reyes).

ADD REPLY • link 10.6 years ago by Devon Ryan 104k

0

Entering edit mode

thanks a lot for your help

ADD REPLY • link 10.6 years ago by Mingkun ▴ 40

2

Entering edit mode

10.6 years ago

Irsan ★ 7.8k

First merge overlapping features with bedops and then count:

awk '{if($3=="exon" && $2=="protein_coding"){print $0}}' Homo_sapiens.GRCh37.75.gtf | gtf2bed - | bedops -m - | awk 'BEGIN{FS="\t";count=0}{count=count + ($3-$2)}END{print count}'

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Irsan ★ 7.8k

0

Entering edit mode

Through this method, I got the size of protein_coding/exon equals to 81M, the size of protein_coding/gene equals to 1.3G, are these numbers make sense?

ADD REPLY • link 10.6 years ago by Mingkun ▴ 40

0

Entering edit mode

sounds like the right number. you can also import the merged bed file into IGV and see how it overlaps with the annotation

ADD REPLY • link 10.6 years ago by Irsan ★ 7.8k

Ram · Accepted Answer · 2014-05-09

3

Entering edit mode

10.6 years ago

Devon Ryan 104k

If a gene has multiple transcripts then you'll end up double/triple/etc. counting some of its exons with that method. In R, you can import the GTF, split() by gene_id and then reduce() that to get non-overlapping exons for this purpose. That'd be a real pain in awk.

ADD COMMENT • link 10.6 years ago by Devon Ryan 104k

0

Entering edit mode

I thought exon won't have this problem (differring from transicript), I will have a look, thanks for your answer.

ADD REPLY • link 10.6 years ago by Mingkun ▴ 40

0

Entering edit mode

You are right, I have found this:

1       protein_coding  gene    860260  879955 gene_id "ENSG00000187634"; gene_name "SAMD11"
1       protein_coding  gene    861264  866445 gene_id "ENSG00000268179"; gene_name "AL645608.1";

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Mingkun ▴ 40