Tool:NOEL: An extremely fast Non-Overlapping Exon Length calculator written in Rust
0
4
Entering edit mode
12 months ago
alejandrogzi ▴ 140

Hi all!

Introducing the Non-Overlapping Exon Length calculator (NOEL), an extremely fast GTF/GFF per gene exon length extractor written in Rust. See the code and latest updates here: github/alejandrogzi/noel

In case you do not want to read the whole text: NOEL outperforms all open-sourced scripts/tools for this task. It can calculate non-overlapping exon lengths for ~62,000 genes in 3.9 seconds (GRCh38 GENCODE 44 GFF3) and 4 seconds (GRCh38 GENCODE 44 GTF). Additionally, just needs at most ~42 Mb of memory, making it accessible for anyone in any type of laptop/PC without having to restart sessions or have crashed runs. This can be easily attached to a nextflow/snakemake pipeline, other scripts (py, R, C++), run as binary, used as a Rust library, and more.

Why?

A week ago I needed to calculate non-overlapping exon lengths, googled some time and found some options (among strictly tools/software or scripts): Kooi 1, Sun 2, and Slowikowski 3,4 scripts, and gtftools (-l flag) 5. I found myself with some problems: missing genes, poor performance, high run times, excessive memory consumption, non CLI-responsive, etc. To maybe help other people with the same goal I had (just quickly calculate non-overlapping exon lengths from a GTF/GFF for any species, easily attached to a pipeline, etc), I develop NOEL (all the information below is part of the github's README):

noel

An extremely fast GTF/GFF per gene Non-Overlapping Exon Length calculator (noel) written in Rust.

enter image description here

Takes in a GTF/GFF file and outputs a .txt file with non-overlapping exon lengths.

Usage

Usage: noel[EXE] --i <GTF/GFF> --o <OUTPUT>

Arguments:
    --i <GTF/GFF>: GTF/GFF file
    --o <OUTPUT>: .txt file

Options:
    --help: print help
    --version: print version

crate: https://crates.io/crates/noel

Installation

to install noel on your system follow this steps:

  1. download rust: curl https://sh.rustup.rs -sSf | sh on unix, or go here for other options
  2. run cargo install noel (make sure ~/.cargo/bin is in your $PATH before running it)
  3. use noel with the required arguments

Build

to build noel from this repo, do:

  1. get rust (as described above)
  2. run git clone https://github.com/alejandrogzi/noel.git && cd noel
  3. run cargo run --release <GTF/GFF> <OUTPUT> (arguments are positional, so you do not need to specify --i/--o)

Benchmark

There are a handful amount of open-sourced tools/software/scripts to calculate non-overlapping exon lengths, namely: Kooi 1, Sun 2, and Slowikowski 3,4 scripts, and gtftools (-l flag) 5. The Non-Overlapping Exon Length calculator (NOEL; referred just as "noel"), is introduced as a novel tool that outperforms the aforementioned software due to its remarkable performance.

To assess the efficiency of noel and test the capabilities of other available scripts/tools, I used run times and memory usage estimates, based on 5 consecutive runs. This evaluation focused on two major gene annotation formats: GTF and GFF. It is worth nothing, however, that only 3 tools are capable of handling GFF files: Slowikowski, Sun* (described below) and noel. Before any batch of runs, I first modified each script to be CLI-responsive. Additionally, I further edited Sun's script to be able to handle GFF inputs by changing a regex pattern. No performance enhance-related changes or breaking structural modifications were applied.

Lastly, to evaluate the output consistency of the top-ranked tools (Sun, gtftools and noel), three species were used: Homo sapiens (GRCh38, GENCODE 44), Canis lupus familiaris (ROS_Cfam_1.0, Ensembl 110), and Mus musculus (GRCm39, GENCODE M33).

enter image description here

The diverse methodologies to calculate non-overlapping exon lengths led to noticeable differences in run times. While Kooi and Slowikowski scripts were the last ranked (>250s for GENCODE 44) with GTF files and Slowikowski only for GFF files (~300s for GENCODE 44); Sun, gtftools and noel were the most efficient options (<50s for GENCODE 44). When analyzing these top-ranked tools, it is quickly perceived the noel's dominance over its competitors. For GTF files, noel achieves noticeably faster computation times when compared to gtftools (x4.3 faster; 4.2s vs 17.9s) and to Sun's script (x10.9 speedup; 4.2s vs 45.7s). On the other hand, noel performs the calculations on GFF3 x12.6 times faster than Sun's script (3.9s vs 49.7s).

enter image description here

A similar pattern is seen when examining memory usage estimates based on GTF files. Three distinct groups of tools can be identified: high-memory-consuming tools (Sun, Slowikowski, and Kooi), tools with moderate memory usage (gtftools), and the most memory-efficient option (noel). Here, noel exhibited a significantly lower memory usage when compared to gtftools (x9.1 less; 42.9 Mb vs 391.8 Mb) and to Kooi (x73.1 less; 42.9 Mb vs 3.1 Gb). With GFF files, on the other hand, noel achieved a striking x146.1-fold reduction in memory usage compared to Slowikowski (62,700 genes).

enter image description here

The comparison of output from the top-ranked tools, including Sun, gtftools, and noel, yielded consistently paired estimates for each species, resulting in a high correlation (R = 0.99). Notably, both noel and Sun's script demonstrated a one-to-one correspondence for every gene in all tested annotation models. In contrast, gtftools exhibited limitations in processing genes, with a slight deficiency in the human and mouse models (0.05% and 0.06%, respectively), and a more substantial shortfall in the dog model (26%). Furthermore, noel outperformed the other tools, significantly improving runtime efficiency in both the mouse and dog models, with a speedup of at least 2.3 times.

Based on this comparative analysis between existing scripts/software to calculate non-overlapping exonic lengths and noel, it is evident that this tool represents a significant improvement. These findings unveil the potential of noel as a valuable resource to provide a fast and efficient way to automate non-overlapping exon length calculations.

Hope this helps someone!

gene-annotation exon-length • 681 views
ADD COMMENT
0
Entering edit mode

Hi! Very nice work. Your code worked well for me for GTF files but silently failed (no output) on GFF3 files from ensembl. Did you test on these files? https://ftp.ensembl.org/pub/release-111/gff3/homo_sapiens/ It is possible there is a slight difference to GENCODE versions? Any help appreciated.

ADD REPLY

Login before adding your answer.

Traffic: 1985 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6