Hello, I am working with large files (eg. 10^7 lines) having a custom tabulated format (eg. geneid TAB SNPid TAB SNPcoord TAB measurex TAB measure_y TAB ...). And I have two wishes / requirements:
- compress the files to save disk space;
- quickly access some lines that are not in the order in which they appear in the file.
I am coding in C++. Up to now, I was using Gzstream to easily read gzipped files (I am too much of a beginner to try to use Zlib directly). But it doesn't allow random access. Thus, I am still using uncompressed files. Typically, I first go through the whole file and record the stream position of the lines I am interested in (using tellg). And then, I access these lines using the stream positions previously recorded (using getline). However, I would like to be able to do the same on compressed files.
From what I read (eg. mentioned here), BGZF allows to do exactly that. I could thus theoretically use this in my code. Has anyone tried to do it? As I am more a geneticist than a programmer, is there any code snippet somewhere I could try to reuse?
Otherwise, I can also design my own minimal binary format. Although it would be quick to implement, it is ad-hoc... Should I rather look into the HDF5 format (eg. here)? Using h5dump, it seems possible to access only to a subset of the data. But has anyone tried to use the functions directly from his C/C++ code?
When accessing the file, if you can order your seeks so that they are sequential by file order they will be much faster. If that's not possible, putting the data on an SSD which is very quick for seeks can speed things up.