The Biostar Herald publishes user submitted links of bioinformatics relevance. It aims to provide a summary of interesting and relevant information you may have missed. You too can submit links here.
This edition of the Herald was brought to you by contribution from Istvan Albert, and was edited by Istvan Albert,
Tiberius: End-to-End Deep Learning with an HMM for Gene Prediction | bioRxiv (www.biorxiv.org)
For more than 25 years, learning-based eukaryotic gene predictors were driven by hidden Markov models (HMMs), which were directly inputted a DNA sequence.
We present Tiberius, a novel deep learning-based ab initio gene predictor that end-to-end integrates convolutional and long short-term memory layers with a differentiable HMM layer. Tiberius uses a custom gene prediction loss and was trained for prediction in mammalian genomes and evaluated on human and two other genomes. It significantly outperforms existing ab initio methods, achieving F1-scores of 62% at gene level for the human genome, compared to 21% for the next best ab initio method. In de novo mode, Tiberius predicts the exon-intron structure of two out of three human genes without error. Remarkably, even Tiberius’s ab initio accuracy matches that of BRAKER3, which uses RNA-seq data and a protein database. Tiberius’s highly parallelized model is the fastest state-of-the-art gene prediction method, processing the human genome in under 2 hours.
submitted by: Istvan Albert
Upstream open reading frames may contain hundreds of novel human exons | PLOS Computational Biology (journals.plos.org)
We examined 2,199 upstream ORFs that have been proposed as high-quality candidates for novel genes, to determine if they could instead represent protein-coding exons that can be added to existing genes.
We determined that 541 out of 2,199 upstream ORFs have strong evidence that they can form protein coding exons that are part of an existing gene, and that the resulting protein is predicted to have similar or better structural quality than the currently annotated isoform.
submitted by: Istvan Albert
GitHub - HKU-BAL/Clair3-RNA: Clair3-RNA - a long-read small variant caller for RNA sequencing data (github.com)
- Clair3-RNA - A deep learning-based small variant caller for long-read RNA sequencing data
- For germline small variant calling, please use Clair3.
- For somatic small variant calling using a tumor-normal pair, please try ClairS.
- For somatic small variant calling using tumor sample only, please try ClairS-TO.
submitted by: Istvan Albert
x.com (x.com)
Looking forward to talking about the cancer microbiome (or lack thereof!) to @broadinstitute and @MIT tomorrow, with my colleague @AbrahamGihawi. I'll be virtual and Abe will be in Cambridge; we're co-authors but have never met in person!
submitted by: Istvan Albert
Navigating the pitfalls of mapping DNA and RNA modifications | Nature Reviews Genetics (www.nature.com)
Chemical modifications to nucleic acids occur across the kingdoms of life and carry important regulatory information. Reliable high-resolution mapping of these modifications is the foundation of functional and mechanistic studies, and recent methodological advances based on next-generation sequencing and long-read sequencing platforms are critical to achieving this aim. However, mapping technologies may have limitations that sometimes lead to inconsistent results. Some of these limitations are technical in nature and specific to certain types of technology. Here, however, we focus on common (yet not always widely recognized) pitfalls that are shared among frequently used mapping technologies and discuss strategies to help technology developers and users mitigate their effects.
submitted by: Istvan Albert
Revolutionizing genomics and medicine—one long molecule at a time (genome.cshlp.org)
A special issue on Long Read Sequencing
submitted by: Istvan Albert
[High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic
variation](https://genome.cshlp.org/content/34/11/2061.abstract?etoc) (genome.cshlp.org)
To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate Long Read Sequencing (LRS) data from at least 800 of the 1KGP samples. [...] We identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
submitted by: Istvan Albert
Release v1.10 - Novel in & novel out · oschwengers/bakta · GitHub (github.com)
bakta: Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
This is the tenth minor release (v1.10) introducing user-provided HMMs, output file recovery, feature inference scores, and various improvements.
An important decision had to be made for this release regarding supported Python versions, external dependencies and features impacted by this. Both Circos and DeepSig seem to be out of support for a long time. Hence, Circos was replaced by pyCirclize, a pure-Python actively-maintained library, enabling a couple of new features. As a result, Bakta's Python dependency had to be bumped to >=3.9, thus unfortunately loosing compatibility with DeepSig. Dropping an existing feature feels odd and wrong, but as a developer, sticking to unmaintained external software for too long constantly increases your daily pain level and slows down the project as a whole. This hasn't been an easy decision, but a necessary. So, If you depend on the detection of signaling peptides, please keep using Bakta <=v1.9.4.
submitted by: Istvan Albert
Want to get the Biostar Herald in your email? Who wouldn't? Sign up righ'ere: toggle subscription