Question

Seeking Public Nanopore Datasets (FAST5 + FASTA/FASTQ) for DNA Storage Research

0

Entering edit mode

9 months ago

Dev • 0

Hi everyone,

I’m working on a project related to DNA-based data storage and am looking for publicly available nanopore sequencing datasets to explore both supervised and unsupervised learning approaches for motif detection and decoding stored information.

For supervised learning, I’m looking for FAST5 files along with their associated FASTA or FASTQ sequences to train models that can learn motif representations directly from raw signal data. The type of dataset I’m interested in would be similar to what is described in this paper: https://www.nature.com/articles/s41598-023-43172-0#Sec4, where structured sequence elements in nanopore reads are analysed.

For unsupervised learning, I’m also interested in raw FAST5 files where sequence labels may not be available, but the dataset contains complex structural patterns that could be useful for learning intrinsic representations of nanopore signal data. Large-scale datasets with diverse sequencing conditions would be especially useful for this.

If anyone is aware of publicly available datasets that fit these criteria or has suggestions on where to find them, I’d greatly appreciate any recommendations!

Thanks in advance!

Dev

Nanopore-Sequencing Machine-Learning FAST5 DNA • 805 views

ADD COMMENT • link updated 9 months ago by Ram 45k • written 9 months ago by Dev • 0

score 2 · Answer 1 · 2025-01-30

You are less likely to find raw sequence data in public databases. fast5 files have also been superseded by POD5 format files. These original datasets can be several times the size of the final fastq files.

If you truly need the raw basecall data your best bet may be to find a group/lab that would be willing to share directly.