Hello! I have a gff3 provided by AUGUSTUS and need to build a table with the name of the predicted genes, their sizes, introns and exons of numbers for each gene. But against a problem because the file contains comments on its entire length. I have about 67,000 genes, but so would take too long to do it one by one. Does anyone have an idea what should I do?
You can get that type of result (even more if the fasta file is provided too):
Number of genes 27707
Number of mrnas 27707
Number of mrnas with utr both sides 9985
Number of mrnas with at least one utr 20693
Number of cdss 27707
Number of exons 131919
Number of five_prime_utrs 15301
Number of three_prime_utrs 15377
Number of exon in cds 119534
Number of exon in five_prime_utr 21204
Number of exon in three_prime_utr 21696
Number of intron in cds 91827
Number of intron in exon 104212
Number of intron in five_prime_utr 5903
Number of intron in three_prime_utr 6319
Number of single exon gene 1232
Number of single exon mrna 1232
mean mrnas per gene 1.0
mean cdss per mrna 1.0
mean exons per mrna 4.8
mean five_prime_utrs per mrna 0.6
mean three_prime_utrs per mrna 0.6
mean exons per cds 4.3
mean exons per five_prime_utr 1.4
mean exons per three_prime_utr 1.4
mean introns in cdss per mrna 3.3
mean introns in exons per mrna 3.8
mean introns in five_prime_utrs per mrna 0.2
mean introns in three_prime_utrs per mrna 0.2
Total gene length 346693759
Total mrna length 334573649
Total cds length 25184373
Total exon length 42796985
Total five_prime_utr length 3907368
Total three_prime_utr length 13705244
Total intron length per cds 270026348
Total intron length per exon 291880876
Total intron length per five_prime_utr 11456694
Total intron length per three_prime_utr 10085428
mean gene length 12512
mean mrna length 12075
mean cds length 908
mean exon length 324
mean five_prime_utr length 255
mean three_prime_utr length 891
mean cds piece length 210
mean five_prime_utr piece length 184
mean three_prime_utr piece length 631
mean intron in cds length 2940
mean intron in exon length 2800
mean intron in five_prime_utr length 1940
mean intron in three_prime_utr length 1596
Longest genes 330825
Longest mrnas 330825
Longest cdss 49575
Longest exons 26237
Longest five_prime_utrs 8910
Longest three_prime_utrs 22461
Longest cds piece 26237
Longest five_prime_utr piece 8273
Longest three_prime_utr piece 22461
Longest intron into cds part 189721
Longest intron into exon part 189721
Longest intron into five_prime_utr part 37945
Longest intron into three_prime_utr part 102332
Shortest genes 6
Shortest mrnas 6
Shortest cdss 6
Shortest exons 1
Shortest five_prime_utrs 1
Shortest three_prime_utrs 1
Shortest intron into cds part 5
Shortest intron into exon part 5
Shortest intron into five_prime_utr part 21
Shortest intron into three_prime_utr part 21
You can get some basic information with shell commands. For example, if you wanted the number of genes you could run grep -c $'\tgene\t' augustus.gff3, of course replacing augustus.gff3 with your actual filename. If you want to get total exon or intron counts, just change the feature type in your command from gene to exon or intron.
However, what you're asking for is a little more involved. It's not terribly complicated, but will require a bit of programming in a language like Python, Ruby, or Perl.
Never succeeded to make it work properly. First installation protocol doesn't work as expected, at least on OSX.
Then the output is wrong, I used a gtf and gff file from Ensembl and got only one sequence without name...
Thanks Daniel, I'll try these. =)