Hi!
You could try using the gffutils Python library as an alternative to the AGAT toolkit for extracting summary statistics from GFF3/GTF files. gffutils is a flexible and efficient library for working with GFF and GTF files in a variety of formats.
Here's an example of how to use gffutils to extract summary statistics from a GFF3 file:
import gffutils
db = gffutils.create_db("input.gff3", dbfn="input.db", force=True, keep_order=True, merge_strategy="merge", sort_attribute_values=True)
gene_count = db.count_features_of_type("gene")
exon_count = db.count_features_of_type("exon")
mRNA_count = db.count_features_of_type("mRNA")
print("Number of genes:", gene_count)
print("Number of exons:", exon_count)
print("Number of mRNAs:", mRNA_count)
This code snippet creates a database from the input GFF3 file, counts the number of genes, exons, and mRNAs, and prints the results. You can modify this example to extract other summary statistics as needed.
To calculate gene_lengths maybe try:
db = gffutils.create_db("input.gff3", dbfn="input.db", force=True, keep_order=True, merge_strategy="merge", sort_attribute_values=True)
gene_lengths = {}
for gene in db.features_of_type("gene"):
gene_length = 0
for exon in db.children(gene, featuretype="exon"):
exon_length = exon.end - exon.start + 1
gene_length += exon_length
gene_lengths[gene.id] = gene_length
for gene_id, length in gene_lengths.items():
print(f"{gene_id}: {length}")
DISCLAIMER: I'm using my chatbot here (https://tinybio.cloud) to help generate this answer. This answer has not been tested and may be incorrect. You can download it from the website.
Does not sound crazy to me, the longest human gene is 2 300 000 bp (dystrophin) You can try the basic statistics script (_sq_) to be sure there is no wrong modification made by AGAT with the script you mentioned (_sp_)
You can also easily check that using an awk command
I am going to tag developer of AGAT: Juke34
GenoMax Thanks. Juke34 . any suggestion on why agat_sp_statistics.pl is producing a inflated maximum gene length number?