I have a fastq file of size 450gb (162,266,303 sequences) and I would like to parse the sequences. I am using Heng Li's C program (https://github.com/lh3/readfq). I am wondering if it is the fastest one as it takes more than an hour to just count the number of sequences using the following program.
#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)
int main(int argc, char **argv)
{
gzFile fp;
kseq_t *seq;
int n = 0, slen = 0, qlen = 0;
fp = gzopen(argv[1], "r");
seq = kseq_init(fp);
while (kseq_read(seq) >= 0){
++n;
/* DO SOME PROCESSING */
}
printf("Total n. of sequences: %d\n", n);
kseq_destroy(seq);
gzclose(fp);
return 0;
}
Thanks Alex for the suggestions.