Benchmark on parsing human chromosome fasta (C source code at the end):
Data Library RealTime UserTime SysTime
-----------------------------------------------------------
humanChr-plain seq_file 127 88 38
humanChr-gzip seq_file 246 237 10
humanChr-plain kseq 11 8 3
humanChr-gzip kseq 31 29 2
humanChr-plain wc 21 9 12
humanChr-gzip zcat|wc 38 39 18
Seq_file seems to be slower than my single-header kseq library. I have not tried on short-read FASTQ. The results may be different. Also, seq_file is linked to zlib-1.2.3. Probably using zlib-1.2.5 wil lead to better performance because in the most widely used zlib-1.2.3, I/O is not buffered.
Source code based on seq_file. Somehow the program is buggy: it always misses the first sequence in the file.
#include <stdio.h>
#include <stdlib.h>
#include "seq_file.h"
int main(int argc, char *argv[])
{
SeqFile *f;
StrBuf *buf;
if (argc == 1) return 1;
f = seq_file_open_filetype(argv[1], SEQ_FASTA);
buf = strbuf_new();
while (seq_next_read(f)) {
seq_read_all_bases(f, buf);
printf("%s\t%ld\n", seq_get_read_name(f), strbuf_len(buf));
strbuf_reset(buf);
}
seq_file_close(f);
strbuf_free(buf);
return 1;
}
Based on kseq:
#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)
int main(int argc, char *argv[])
{
gzFile fp = gzopen(argv[1], "r");
kseq_t *ks = kseq_init(fp);
if (argc == 1) return 1;
while (kseq_read(ks) >= 0)
printf("%s\t%ld\n", ks->name.s, ks->seq.l);
kseq_destroy(ks);
gzclose(fp);
return 0;
}
Do you have any sense of how it is better than the htslib upon which it is apprently built?