Hi!
You could try using the gffutils Python library as an alternative to the AGAT toolkit for extracting summary statistics from GFF3/GTF files. gffutils is a flexible and efficient library for working with GFF and GTF files in a variety of formats.
Here's an example of how to use gffutils to extract summary statistics from a GFF3 file:
import gffutils
# Create a database from the GFF3 file
db = gffutils.create_db("input.gff3", dbfn="input.db", force=True, keep_order=True, merge_strategy="merge", sort_attribute_values=True)
# Extract summary statistics
gene_count = db.count_features_of_type("gene")
exon_count = db.count_features_of_type("exon")
mRNA_count = db.count_features_of_type("mRNA")
print("Number of genes:", gene_count)
print("Number of exons:", exon_count)
print("Number of mRNAs:", mRNA_count)
This code snippet creates a database from the input GFF3 file, counts the number of genes, exons, and mRNAs, and prints the results. You can modify this example to extract other summary statistics as needed.
To calculate gene_lengths maybe try:
# Create a database from the GFF3/GTF file
db = gffutils.create_db("input.gff3", dbfn="input.db", force=True, keep_order=True, merge_strategy="merge", sort_attribute_values=True)
# Calculate gene lengths
gene_lengths = {}
for gene in db.features_of_type("gene"):
gene_length = 0
for exon in db.children(gene, featuretype="exon"):
exon_length = exon.end - exon.start + 1
gene_length += exon_length
gene_lengths[gene.id] = gene_length
# Print gene lengths
for gene_id, length in gene_lengths.items():
print(f"{gene_id}: {length}")
DISCLAIMER: I'm using my chatbot here (https://tinybio.cloud) to help generate this answer. This answer has not been tested and may be incorrect. You can download it from the website.
Does not sound crazy to me, the longest human gene is 2 300 000 bp (dystrophin) You can try the basic statistics script (_sq_) to be sure there is no wrong modification made by AGAT with the script you mentioned (_sp_)
You can also easily check that using an awk command
I am going to tag developer of AGAT: Juke34
GenoMax Thanks. Juke34 . any suggestion on why agat_sp_statistics.pl is producing a inflated maximum gene length number?