Question

Faster reading sequences from a fastq file

0

Entering edit mode

6.9 years ago

brs1111 ▴ 10

I have a fastq file of size 450gb (162,266,303 sequences) and I would like to parse the sequences. I am using Heng Li's C program (https://github.com/lh3/readfq). I am wondering if it is the fastest one as it takes more than an hour to just count the number of sequences using the following program.

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)

int main(int argc, char **argv)
{
        gzFile fp;
        kseq_t *seq;
        int n = 0, slen = 0, qlen = 0;
        fp = gzopen(argv[1], "r");
        seq = kseq_init(fp);
        while (kseq_read(seq) >= 0){
                ++n;
               /* DO SOME PROCESSING */
        }
        printf("Total n. of sequences: %d\n", n);
        kseq_destroy(seq);
        gzclose(fp);
        return 0;
}

fastq c • 5.6k views

ADD COMMENT • link updated 6.9 years ago by kloetzl ★ 1.1k • written 6.9 years ago by brs1111 ▴ 10

score 4 · Answer 1 · 2018-01-17

That's a pretty big file. Perhaps profile the code and see where things are slowest. You might improve the speed of zlib decompression with https://github.com/ebiggers/libdeflate or similar. Or look at recompressing with bzip2 or other blocked compression schemes and using a parallel decompression routine like pbzip2. Likewise, you might look at how kseq_read() is reading in and parsing buffers of data, to see if any buffer sizes or other read tweaks can be made to improve I/O. If you have an SSD, you might use that for storage to enable parallel reads, using an asynchronous library to read and parse multiple streams.

score 2 · Answer 2 · 2018-01-17

2

Entering edit mode

6.9 years ago

GenoMax 148k

You could look to see if you can use pigz (https://zlib.net/pigz/ ) instead of gzip.

BTW: 450gb sounds rather large for 162 Million sequences. Is that the uncompressed size?

ADD COMMENT • link 6.9 years ago by GenoMax 148k

0

Entering edit mode

Yes, it is not compressed.

ADD REPLY • link 6.9 years ago by brs1111 ▴ 10

0

Entering edit mode

If the file is uncompressed then perhaps only @kloetzl's answer applies in your case. Is this non-illumina data by any chance?

ADD REPLY • link 6.9 years ago by GenoMax 148k

0

Entering edit mode

Just curious, but why are you using gzopen() if the input is not compressed?

ADD REPLY • link 6.9 years ago by Alex Reynolds 36k

score 1 · Answer 3 · 2018-01-17

1

Entering edit mode

6.9 years ago

Dan D 7.4k

fastp might be worth a look.

ADD COMMENT • link 6.9 years ago by Dan D 7.4k

score 1 · Answer 4 · 2018-01-17

1

Entering edit mode

6.9 years ago

chen ★ 2.5k

You can try the code of fastp, which provides ultra-fast FASTQ file preprocessing. You may look into the src/fastqreader.cpp which provides a almost-standalone C++ class to read FASTQ file, and can be reused by anyone else.

ADD COMMENT • link 6.9 years ago by chen ★ 2.5k

0

Entering edit mode

Thanks Chen !! Let me try fastp

ADD REPLY • link 6.9 years ago by brs1111 ▴ 10

0

Entering edit mode

Hi Chen, I am not able to figure out how to use it. Could you please provide a simple main() ?

ADD REPLY • link 6.9 years ago by brs1111 ▴ 10

score 1 · Answer 5 · 2018-01-18

1

Entering edit mode

6.9 years ago

kloetzl ★ 1.1k

If your data is not gzipped, you can avoid going through zlib by just KSEQ_INIT(int, read). That should reduce latency. Make sure to have a buffer in kseq of atleast 16384 bytes. Try increasing it, to avoid syscall overheads (which became more pronounced by Meltdown mitigations). The maximum throughput I could measure for kseq is 847MiB/s. You can get a bit faster ~1GiB/s if you start using intrinsics, for special instructions (pcmpistri). Note sure if using a mmapped approach would give you any advantage here.

ADD COMMENT • link 6.9 years ago by kloetzl ★ 1.1k

0

Entering edit mode

Thanks for the suggestion. I modified the code and it works good now. It takes little more than an hour just to read and count the number sequences. But when I calculate hash for all k-mers of all sequences , it takes more than 7 hours. I understand that there are additional operations for calculating the hash, but the programs like KmerStream(https://github.com/pmelsted/KmerStream) and ntCard(https://github.com/bcgsc/ntCard) does that in less than 2 hours. Here is the sample code I am using.

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
KSEQ_INIT(int, read)

int main(int argc, char **argv)
{
        FILE* fp;
        kseq_t *seq;
        int n = 0, slen = 0, qlen = 0;
        fp = fopen(argv[1], "r");
        int k = atoi(argv[2]); // K-mer length
        seq = kseq_init(fileno(fp));
        while (kseq_read(seq) >= 0){
           for(int i=0; i<strlen(seq->seq.s)- k+1; i++)
             uint64_t hash = HASH(seq->seq.s+i, k, seed);   // HASH is either murmerHash (used in KmerStream) or ntHash (used in ntCard)
        }
        kseq_destroy(seq);
        fclose(fp);
        return 0;
}

ADD REPLY • link 6.9 years ago by brs1111 ▴ 10

0

Entering edit mode

Those tools are using a rolling hash. I.e. they use the previous hash value to quickly compute the next one.

ADD REPLY • link 6.9 years ago by kloetzl ★ 1.1k

0

Entering edit mode

Yes, I saw that on ntCard paper. In fact, I used the sample code given in ntHash

ADD REPLY • link 6.9 years ago by brs1111 ▴ 10

0

Entering edit mode

Maybe you forgot some compiler flags? -O3 -march=native. Sorry, can't help much more than that, remotely.

ADD REPLY • link 6.9 years ago by kloetzl ★ 1.1k

0

Entering edit mode

Thanks for your help !! Let me try these compiler flags. I will try to figure it out.

ADD REPLY • link 6.9 years ago by brs1111 ▴ 10