Hi everyone,
I’m working on a project related to DNA-based data storage and am looking for publicly available nanopore sequencing datasets to explore both supervised and unsupervised learning approaches for motif detection and decoding stored information.
For supervised learning, I’m looking for FAST5 files along with their associated FASTA or FASTQ sequences to train models that can learn motif representations directly from raw signal data. The type of dataset I’m interested in would be similar to what is described in this paper: https://www.nature.com/articles/s41598-023-43172-0#Sec4, where structured sequence elements in nanopore reads are analysed.
For unsupervised learning, I’m also interested in raw FAST5 files where sequence labels may not be available, but the dataset contains complex structural patterns that could be useful for learning intrinsic representations of nanopore signal data. Large-scale datasets with diverse sequencing conditions would be especially useful for this.
If anyone is aware of publicly available datasets that fit these criteria or has suggestions on where to find them, I’d greatly appreciate any recommendations!
Thanks in advance!
Dev