Analysis gff3 file
4
2
Entering edit mode
9.4 years ago

Hello! I have a gff3 provided by AUGUSTUS and need to build a table with the name of the predicted genes, their sizes, introns and exons of numbers for each gene. But against a problem because the file contains comments on its entire length. I have about 67,000 genes, but so would take too long to do it one by one. Does anyone have an idea what should I do?

genome gff3 annotation • 7.9k views
ADD COMMENT
2
Entering edit mode
6.1 years ago
Juke34 8.9k

The question is about statistics per gene. So my answer is maybe not pertinent, but here is an answer for global statistics:

GAG is really good for that: here the github, and here the publication..
But as it was not exhaustive enough for my purpose I wrote my own script called gff3_sp_statistics.pl but you have to properly install the GAAS repository. agat_sp_statistics.pl available in AGAT.

You can get that type of result (even more if the fasta file is provided too):

Number of genes                              27707
Number of mrnas                              27707
Number of mrnas with utr both sides          9985
Number of mrnas with at least one utr        20693
Number of cdss                               27707
Number of exons                              131919
Number of five_prime_utrs                    15301
Number of three_prime_utrs                   15377
Number of exon in cds                        119534
Number of exon in five_prime_utr             21204
Number of exon in three_prime_utr            21696
Number of intron in cds                      91827
Number of intron in exon                     104212
Number of intron in five_prime_utr           5903
Number of intron in three_prime_utr          6319
Number of single exon gene                   1232
Number of single exon mrna                   1232
mean mrnas per gene                          1.0
mean cdss per mrna                           1.0
mean exons per mrna                          4.8
mean five_prime_utrs per mrna                0.6
mean three_prime_utrs per mrna               0.6
mean exons per cds                           4.3
mean exons per five_prime_utr                1.4
mean exons per three_prime_utr               1.4
mean introns in cdss per mrna                3.3
mean introns in exons per mrna               3.8
mean introns in five_prime_utrs per mrna     0.2
mean introns in three_prime_utrs per mrna    0.2
Total gene length                            346693759
Total mrna length                            334573649
Total cds length                             25184373
Total exon length                            42796985
Total five_prime_utr length                  3907368
Total three_prime_utr length                 13705244
Total intron length per cds                  270026348
Total intron length per exon                 291880876
Total intron length per five_prime_utr       11456694
Total intron length per three_prime_utr      10085428
mean gene length                             12512
mean mrna length                             12075
mean cds length                              908
mean exon length                             324
mean five_prime_utr length                   255
mean three_prime_utr length                  891
mean cds piece length                        210
mean five_prime_utr piece length             184
mean three_prime_utr piece length            631
mean intron in cds length                    2940
mean intron in exon length                   2800
mean intron in five_prime_utr length         1940
mean intron in three_prime_utr length        1596
Longest genes                                330825
Longest mrnas                                330825
Longest cdss                                 49575
Longest exons                                26237
Longest five_prime_utrs                      8910
Longest three_prime_utrs                     22461
Longest cds piece                            26237
Longest five_prime_utr piece                 8273
Longest three_prime_utr piece                22461
Longest intron into cds part                 189721
Longest intron into exon part                189721
Longest intron into five_prime_utr part      37945
Longest intron into three_prime_utr part     102332
Shortest genes                               6
Shortest mrnas                               6
Shortest cdss                                6
Shortest exons                               1
Shortest five_prime_utrs                     1
Shortest three_prime_utrs                    1
Shortest intron into cds part                5
Shortest intron into exon part               5
Shortest intron into five_prime_utr part     21
Shortest intron into three_prime_utr part    21

...

ADD COMMENT
0
Entering edit mode
9.4 years ago

You can get some basic information with shell commands. For example, if you wanted the number of genes you could run grep -c $'\tgene\t' augustus.gff3, of course replacing augustus.gff3 with your actual filename. If you want to get total exon or intron counts, just change the feature type in your command from gene to exon or intron.

However, what you're asking for is a little more involved. It's not terribly complicated, but will require a bit of programming in a language like Python, Ruby, or Perl.

ADD COMMENT
0
Entering edit mode

Thanks Daniel, I'll try these. =)

ADD REPLY
0
Entering edit mode
9.4 years ago

Take a look to GFF-Ex to test if it could be useful for you. This is a standalone genomic feature extraction package

ADD COMMENT
0
Entering edit mode

Never succeeded to make it work properly. First installation protocol doesn't work as expected, at least on OSX. Then the output is wrong, I used a gtf and gff file from Ensembl and got only one sequence without name...

ADD REPLY
0
Entering edit mode
6.1 years ago
Vitis ★ 2.6k

See this related question:

Plot statistics from gtf/gff file

ADD COMMENT

Login before adding your answer.

Traffic: 1998 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6