I have I store very large amounts of data associated with a project. This data is in the format of tabular, each table probably has or 4 string based annotation columns and maybe up to three or 4 data columns: think output of featureCounts or salmon. Each flat file has maybe several million rows and there is one file per sample per analysis, and there are ~500 samples and maybe 10 analyses. There are a few pure annotation tables as well. This all adds up to maybe 1TB of uncompressed data per set, and I have (or will have) perhaps 10 sets. Lots of the data is very compressible, with, for example, featureCounts tables compressing to perhaps 1/10 of their original size with gzip.
I want a way of storing this data that:
- Makes it easily accessible: for example, pulling out all rows that meet a certain criteria in a minimal number of samples.
- Ideally I'd like to be able to to joins and merges.
- Sometimes the data will want to be as a matrix, sometimes in tidy format.
- Ideally it should be compressed.
- Should be able to manipulate in a disk backed manner (i.e. not have to load the whole table in to memory to analyze).
Up until now I've been using sqlite databases. But
a) They seem really slow. If you don't need to do any joins/merges its much faster to filter a flat file than run a query with a WHERE cluse. Joining is also way quicker on data frames (pandas or R) than on the database, but running out of memory is a real problem.
b) They are not compressed, which means we are spending lots on expensive storage to store data that is easily compressible.
I've thought of a more full fat database solution, like MySQL, but I don't really have the facilities to run a server. I've also thought about things like hdf5 (e.g. AnnData), but I don't have much experience. Or perhaps I'm just best leaving it as flat files and writing scripts to do the common parsing.
Does any body have experience of this sort of thing they'd like to share?