If I were doing this to an uncompressed fastq file, I'd have a pretty easy task: read in fastq records, one by one, modify the header slightly, output them one by one to a new file.
These fastq files aren't sorted so the output's records need not be in the same order.
I'm wondering if anyone has an example of this sort of use-case with htslib. If not, I'm also just wondering on a basic level what strategy I should employ to properly read and write my compressed fastq files as fast as possible:
1) Use htslib's thread pool which is designed to compress/decompress bgzf blocks... which I think are independent of fastq record boundaries. This means I'd probably need the whole file in memory before looping through records. I think.
2) Use my own thread routines and leverage the .fai and .gzi files for random access into the compressed fastq, each thread assigned a more/less equal sized slice of the file: decompressing, reading, transforming, then waiting for a mutex on a write thread to unlock and then writing out uncompressed data to a file, which I compress with bgzip later.
Any advice on 1 vs 2? Anyone have a htslib "cookbook" somewhere with a bunch of different recipes? I'd be thrilled if that exists. I'm having a really tough time trying to figure out how to use htslib. My C/C++ isn't nearly as strong as the authors of that library.
Thanks.
sorry if off topic but what is the purpose of using bgzip fastq? do you utilize the random access aspect of bgzip?
Speed. It's much faster than regular gzip. And if compiled with libdeflate, even better. I'm looking for anything to help me code with the htslib library. Examples of apps leveraging it, etc.
The code itself is extremely dense. I can tell there's sort of a hierarchy of high-level functions and structs declared in hts.h and that's what I wish to stick to as much as possible.
The htslib docs (htslib.org) really seem to just show you how to use samtools or bcftools.
is that really true ? in what circumstance is it faster? would be curious to see benchmark. sorry i'm not being helpful
In the circumstance that you have a multithreaded CPU and the application speed is bound to the compression/decompression speed. Which will be often, certainly in situations with a mild transformation being applied. I don't know what's so controversial about this; bgzip has been around for a decade. Here, I got a 3.4 GB test.fastq: