Hi all!
Introducing the Non-Overlapping Exon Length calculator (NOEL), an extremely fast GTF/GFF per gene exon length extractor written in Rust. See the code and latest updates here: github/alejandrogzi/noel
In case you do not want to read the whole text: NOEL outperforms all open-sourced scripts/tools for this task. It can calculate non-overlapping exon lengths for ~62,000 genes in 3.9 seconds (GRCh38 GENCODE 44 GFF3) and 4 seconds (GRCh38 GENCODE 44 GTF). Additionally, just needs at most ~42 Mb of memory, making it accessible for anyone in any type of laptop/PC without having to restart sessions or have crashed runs. This can be easily attached to a nextflow/snakemake pipeline, other scripts (py, R, C++), run as binary, used as a Rust library, and more.
Why?
A week ago I needed to calculate non-overlapping exon lengths, googled some time and found some options (among strictly tools/software or scripts): Kooi 1, Sun 2, and Slowikowski 3,4 scripts, and gtftools (-l flag) 5. I found myself with some problems: missing genes, poor performance, high run times, excessive memory consumption, non CLI-responsive, etc. To maybe help other people with the same goal I had (just quickly calculate non-overlapping exon lengths from a GTF/GFF for any species, easily attached to a pipeline, etc), I develop NOEL (all the information below is part of the github's README):
noel
An extremely fast GTF/GFF per gene Non-Overlapping Exon Length calculator (noel) written in Rust.
Takes in a GTF/GFF file and outputs a .txt file with non-overlapping exon lengths.
Usage
Usage: noel[EXE] --i <GTF/GFF> --o <OUTPUT>
Arguments:
--i <GTF/GFF>: GTF/GFF file
--o <OUTPUT>: .txt file
Options:
--help: print help
--version: print version
crate: https://crates.io/crates/noel
Installation
to install noel on your system follow this steps:
- download rust:
curl https://sh.rustup.rs -sSf | sh
on unix, or go here for other options - run
cargo install noel
(make sure~/.cargo/bin
is in your$PATH
before running it) - use
noel
with the required arguments
Build
to build noel from this repo, do:
- get rust (as described above)
- run
git clone https://github.com/alejandrogzi/noel.git && cd noel
- run
cargo run --release <GTF/GFF> <OUTPUT>
(arguments are positional, so you do not need to specify --i/--o)
Benchmark
There are a handful amount of open-sourced tools/software/scripts to calculate non-overlapping exon lengths, namely: Kooi 1, Sun 2, and Slowikowski 3,4 scripts, and gtftools (-l flag) 5. The Non-Overlapping Exon Length calculator (NOEL; referred just as "noel"), is introduced as a novel tool that outperforms the aforementioned software due to its remarkable performance.
To assess the efficiency of noel and test the capabilities of other available scripts/tools, I used run times and memory usage estimates, based on 5 consecutive runs. This evaluation focused on two major gene annotation formats: GTF and GFF. It is worth nothing, however, that only 3 tools are capable of handling GFF files: Slowikowski, Sun* (described below) and noel. Before any batch of runs, I first modified each script to be CLI-responsive. Additionally, I further edited Sun's script to be able to handle GFF inputs by changing a regex pattern. No performance enhance-related changes or breaking structural modifications were applied.
Lastly, to evaluate the output consistency of the top-ranked tools (Sun, gtftools and noel), three species were used: Homo sapiens (GRCh38, GENCODE 44), Canis lupus familiaris (ROS_Cfam_1.0, Ensembl 110), and Mus musculus (GRCm39, GENCODE M33).
The diverse methodologies to calculate non-overlapping exon lengths led to noticeable differences in run times. While Kooi and Slowikowski scripts were the last ranked (>250s for GENCODE 44) with GTF files and Slowikowski only for GFF files (~300s for GENCODE 44); Sun, gtftools and noel were the most efficient options (<50s for GENCODE 44). When analyzing these top-ranked tools, it is quickly perceived the noel's dominance over its competitors. For GTF files, noel achieves noticeably faster computation times when compared to gtftools (x4.3 faster; 4.2s vs 17.9s) and to Sun's script (x10.9 speedup; 4.2s vs 45.7s). On the other hand, noel performs the calculations on GFF3 x12.6 times faster than Sun's script (3.9s vs 49.7s).
A similar pattern is seen when examining memory usage estimates based on GTF files. Three distinct groups of tools can be identified: high-memory-consuming tools (Sun, Slowikowski, and Kooi), tools with moderate memory usage (gtftools), and the most memory-efficient option (noel). Here, noel exhibited a significantly lower memory usage when compared to gtftools (x9.1 less; 42.9 Mb vs 391.8 Mb) and to Kooi (x73.1 less; 42.9 Mb vs 3.1 Gb). With GFF files, on the other hand, noel achieved a striking x146.1-fold reduction in memory usage compared to Slowikowski (62,700 genes).
The comparison of output from the top-ranked tools, including Sun, gtftools, and noel, yielded consistently paired estimates for each species, resulting in a high correlation (R = 0.99). Notably, both noel and Sun's script demonstrated a one-to-one correspondence for every gene in all tested annotation models. In contrast, gtftools exhibited limitations in processing genes, with a slight deficiency in the human and mouse models (0.05% and 0.06%, respectively), and a more substantial shortfall in the dog model (26%). Furthermore, noel outperformed the other tools, significantly improving runtime efficiency in both the mouse and dog models, with a speedup of at least 2.3 times.
Based on this comparative analysis between existing scripts/software to calculate non-overlapping exonic lengths and noel, it is evident that this tool represents a significant improvement. These findings unveil the potential of noel as a valuable resource to provide a fast and efficient way to automate non-overlapping exon length calculations.
Hope this helps someone!
Hi! Very nice work. Your code worked well for me for GTF files but silently failed (no output) on GFF3 files from ensembl. Did you test on these files? https://ftp.ensembl.org/pub/release-111/gff3/homo_sapiens/ It is possible there is a slight difference to GENCODE versions? Any help appreciated.