Seeking Public Nanopore Datasets (FAST5 + FASTA/FASTQ) for DNA Storage Research
1
0
Entering edit mode
22 days ago
Dev • 0

Hi everyone,

I’m working on a project related to DNA-based data storage and am looking for publicly available nanopore sequencing datasets to explore both supervised and unsupervised learning approaches for motif detection and decoding stored information.

For supervised learning, I’m looking for FAST5 files along with their associated FASTA or FASTQ sequences to train models that can learn motif representations directly from raw signal data. The type of dataset I’m interested in would be similar to what is described in this paper: https://www.nature.com/articles/s41598-023-43172-0#Sec4, where structured sequence elements in nanopore reads are analysed.

For unsupervised learning, I’m also interested in raw FAST5 files where sequence labels may not be available, but the dataset contains complex structural patterns that could be useful for learning intrinsic representations of nanopore signal data. Large-scale datasets with diverse sequencing conditions would be especially useful for this.

If anyone is aware of publicly available datasets that fit these criteria or has suggestions on where to find them, I’d greatly appreciate any recommendations!

Thanks in advance!

Dev

Nanopore-Sequencing Machine-Learning FAST5 DNA • 233 views
ADD COMMENT
2
Entering edit mode
22 days ago
GenoMax 149k

You are less likely to find raw sequence data in public databases. fast5 files have also been superseded by POD5 format files. These original datasets can be several times the size of the final fastq files.

If you truly need the raw basecall data your best bet may be to find a group/lab that would be willing to share directly.

ADD COMMENT

Login before adding your answer.

Traffic: 2373 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6