Entering edit mode
3.7 years ago
MutationalMeltdown
▴
200
I generated a scRNA-seq object (counts, PCA, UMAP embeddings, DEGs etc.) in Scanpy or Seurat. What is the best data structure to store this in to reduce the size of the object?
I'm considering H5AD (scanpy/anndata), RDS or H5Seurat (Seurat), or Loom
Fast loading/access would also be good of course, thanks
For archival purposes? or for access in other tools? gzip fast to decompress on the fly as well.
To access, imagine a database of objects. H5AD supports gzip compression and I use it. The main issue is the counts matrix itself (even a sparse matrix tends to be far larger than the annotations). All the approaches I suggest (apart from RDS) use HDF5 format, which seems pretty optimised, but I'm interested in the differences between them and any alternatives
I don't fully understand the problem and its requirements, but gut feeling wise I would steer far-far away from RDS, and instead would design it by relying on sparse matrix save with scipy/numpy (scipy.sparse.save_npz) then model the rest of the information as a relational database in SQlite.
I feel that would give the highest level of flexibility and extensibility for the future.
Right, but that's similar what the anndata (the object written to H5AD) is already - isn't that reinventing the wheel or am I missing something? https://anndata.readthedocs.io/en/latest/
if it is H5AD then can't be a relational database, right?
I honestly think that storing biological data in hdf5 format is a mistake, relational databases are more elegant, robust and simple to use from any language or no language at all,
the problem with relational database is the scalability and dimensionality is beyond relational database's limits. for mysql, only hundreds of numerical columns could be hold. for oracle, 1k column is its upper limit. We are building a specialized database called unified giant table holding the large scale omics data.