Question

Why is bcl2fastq2 taking so long to calculate stats?

0

Entering edit mode

3.2 years ago

divibisan • 0

Our lab has been using bcl2fastq v2.20.0.422 to demultiplex RNA-seq data sequenced on an Illumina Novaseq machine on a beefy EC2 instance and we've run into the strange problem: namely that while demultiplexing is very fast, generating the stats files are unbearably slow.

In a recent example, we demultiplexed one lane of a Novaseq run (total Fastq.gz files were 464 GB) on a m5.24xlarge instance (with 96 vCPUs) with the data being read and written to the same EFS mount. The demultiplexing took only ~70 minutes, while generating the Stats files took over 4 hours (14,557 seconds to be exact)!

Running with log level INFO, the demultiplexing progressed quickly as expected, all Fastq files were generated, and the bcl2fastq output the message:

INFO: Created InterOp file '"./InterOp/IndexMetricsOut.bin"

and sat for hours using <1% CPU writing to IndexMetricsOut.bin very slowly. Tracking the size of the IndexMetricsOut.bin file showed that it was being continuously written to at the rate of about 4kb per second. It did eventually finish (final size 49M) and bcl2fastq reported that it completed successfully, but it seems crazy to me that this stage could take so much longer than actually demultiplexing almost half a TB of data.

Does anyone have any ideas why the stats files are taking so long to generate, and what I could do about it? Additionally, since the fastq files seem to be completed, is it safe to just move on to alignment without waiting for bcl2fastq to finish?

bcl2fastq illumina RNA-seq AWS • 2.2k views

ADD COMMENT • link updated 3.2 years ago by GenoMax 147k • written 3.2 years ago by divibisan • 0

score 0 · Answer 1 · 2021-09-29

0

Entering edit mode

3.2 years ago

swbarnes2 14k

We don't have a Novaseq, but we typically run hundreds of samples at a time on the NextSeq, and it takes a long time to generate those reports. I generally go ahead with alignment without waiting for them.

ADD COMMENT • link 3.2 years ago by swbarnes2 14k

score 0 · Answer 2 · 2021-09-29

You should consider switching to using bcl-convert. It is the long term replacement for bcl2fastq and is required for NextSeq 1K/2K (and possible future sequencers). Bclconvert is backwards compatible will all sequencers and is significantly faster than bcl2fastq. Illumina has a different format for demultiplexing reports which are now plain text files in bcl-convert and are easier to parse.