Question

Forum:Best formats for storing large amounts of data for random access

1

Entering edit mode

2.9 years ago

i.sudbery 21k

I have I store very large amounts of data associated with a project. This data is in the format of tabular, each table probably has or 4 string based annotation columns and maybe up to three or 4 data columns: think output of featureCounts or salmon. Each flat file has maybe several million rows and there is one file per sample per analysis, and there are ~500 samples and maybe 10 analyses. There are a few pure annotation tables as well. This all adds up to maybe 1TB of uncompressed data per set, and I have (or will have) perhaps 10 sets. Lots of the data is very compressible, with, for example, featureCounts tables compressing to perhaps 1/10 of their original size with gzip.

I want a way of storing this data that:

Makes it easily accessible: for example, pulling out all rows that meet a certain criteria in a minimal number of samples.
Ideally I'd like to be able to to joins and merges.
Sometimes the data will want to be as a matrix, sometimes in tidy format.
Ideally it should be compressed.
Should be able to manipulate in a disk backed manner (i.e. not have to load the whole table in to memory to analyze).

Up until now I've been using sqlite databases. But

a) They seem really slow. If you don't need to do any joins/merges its much faster to filter a flat file than run a query with a WHERE cluse. Joining is also way quicker on data frames (pandas or R) than on the database, but running out of memory is a real problem.

b) They are not compressed, which means we are spending lots on expensive storage to store data that is easily compressible.

I've thought of a more full fat database solution, like MySQL, but I don't really have the facilities to run a server. I've also thought about things like hdf5 (e.g. AnnData), but I don't have much experience. Or perhaps I'm just best leaving it as flat files and writing scripts to do the common parsing.

Does any body have experience of this sort of thing they'd like to share?

file-formats databases • 4.2k views

ADD COMMENT • link updated 2.9 years ago by Matthias Zepper 5.1k • written 2.9 years ago by i.sudbery 21k

score 3 · Answer 1 · 2022-07-25

HDF5 was already mentioned, other than that, I'd consider Apache Parquet and Zarr.

Parquet had the advantage of efficient compression and columnar access. It plays nicely with Arrow to process and access data. Due to its wide adoption, there are bindings to many languages and also some science-focussed tutorials.

In contrast, Zarr is still somewhat exotic and basically restricted to Python. While I am quite sympathetic to it, I have myself just dabbled around so far and can't really judge how it will behave with data of that scale.

score 2 · Answer 2 · 2022-07-25

Since you asked about HDF5 on Slack:

Makes it easily accessible: for example, pulling out all rows that meet a certain criteria in a minimal number of samples.

With an HDF5 container, you need to pull out data from the matrix to evaluate it and apply filtering criteria.

You can pull out data by rows and/or columns in a "chunked" or striped container, though, so if you index things right, your search and retrieval of data can be fast, or parallelized.

Ideally I'd like to be able to to joins and merges.

With HDF5, you will generally need to pull out data from the matrix to evaluate it and apply set or set-like operations.

Sometimes the data will want to be as a matrix, sometimes in tidy format.

You can store variable-type Numpy arrays easily in an HDF5 container, and access them directly. You can also store binary data, such as what would be in a serialized tidy matrix, though you would need to extract and deserialize the matrix, if stored in that form, before using it.

Ideally it should be compressed.

Numpy matrices can be easily compressed and are readable when compressed: https://www.hdfgroup.org/2017/05/hdf5-data-compression-demystified-2-performance-tuning/

Should be able to manipulate in a disk backed manner (i.e. not have to load the whole table in to memory to analyze).

To the best of my knowledge, it is not possible to persist incremental changes to an HDF5 container, at least with the Python API. You would load data into memory, make changes, and write the altered dataset/groups/etc. from memory to a new HDF5 container (or overwrite the existing container).

If you're doing a lot of Pandas-based analyses and your work is parallelizable, maybe take a look at Dask: https://github.com/dask/dask-tutorial

score 0 · Answer 3 · 2022-07-25

0

Entering edit mode

2.9 years ago

Pierre Lindenbaum 166k

have a look at berkeleydb. The "only" problem is that you have to make a program to access your database.

ADD COMMENT • link 2.9 years ago by Pierre Lindenbaum 166k