Gene prediction in the era of long read sequencing data and many reference genomes
1
1
Entering edit mode
17 months ago
William ★ 5.3k

With the availability and affordability of long read sequencing it has become possible to create many reference genomes. For individuals of the same species, or for many species.

See e.g. how Hifiasm can be used with a single command to create a reference genome in a few hours or days for eukaryotic genomes. https://github.com/chhylp123/hifiasm

I am wondering what now, with modern long read RNAseq data, is an effective way to create good enough genome models for the many reference genomes.

The gene models don't need to be perfect, they never will be.

But they should contain genes, transcripts, exons, CDS and be compatible with genome browsers and analysis tools like e.g. AGAT, SnpEff.

Gene model prediction tools from when reference genomes took years to make, are quite good. But they take weeks to months to run, require many different types of input data, and require many different commands.

I am wondering what now the good enough gene prediction tools are. Given as mentioned the many reference genomes and long read RNAseq data.

gene-prediction • 610 views
ADD COMMENT
1
Entering edit mode
17 months ago
William ★ 5.3k

BRAKER3 looks interesting.

By chance it's preprint came online today (same day as posting this question).

"New eukaryotic genomes are being sequenced at increasing rates. However, the pace of genome annotation, which establishes links between genomic sequence and biological function, is lagging behind. For example, in April 2023 49% of the eukaryotic species with assemblies in GenBank, had no annotation in 5 GenBank. Undertakings such as the Earth BioGenome Project (https://www.earthbiogenome.org), which aims to annotate c.a. 1.5 million eukaryotic species, further require that the annotation pipeline is highly automated and reliable and ideally no manual work for each species is required when genome assembly and RNA-Seq are given"

BRAKER3 is the latest pipeline in the BRAKER suite. It enables the usage of RNA-seq and protein data in a fully automated pipeline to train and predict highly reliable genes with GeneMark-ETP and AUGUSTUS.

Here we present BRAKER3, a novel genome annotation pipeline for eukaryotic genomes that integrates evidence from transcript reads, homologous proteins and the genome itself. We report significantly improved accuracy for 11 test species. BRAKER3 outperforms its predecessors BRAKER1 and BRAKER2 by a large margin, as well as publicly available pipelines, such as MAKER2, FINDER and Funannotate. The most substantial improvements are observed in species with large and complex genomes. Additionally, BRAKER3 adds a Singularity container to the BRAKER suite, which makes it more user-friendly and easier to install.

https://www.biorxiv.org/content/10.1101/2023.06.10.544449v1

https://github.com/Gaius-Augustus/BRAKER

ADD COMMENT

Login before adding your answer.

Traffic: 1829 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6