Our lab has been using bcl2fastq v2.20.0.422 to demultiplex RNA-seq data sequenced on an Illumina Novaseq machine on a beefy EC2 instance and we've run into the strange problem: namely that while demultiplexing is very fast, generating the stats files are unbearably slow.
In a recent example, we demultiplexed one lane of a Novaseq run (total Fastq.gz files were 464 GB) on a m5.24xlarge instance (with 96 vCPUs) with the data being read and written to the same EFS mount. The demultiplexing took only ~70 minutes, while generating the Stats files took over 4 hours (14,557 seconds to be exact)!
Running with log level INFO, the demultiplexing progressed quickly as expected, all Fastq files were generated, and the bcl2fastq output the message:
INFO: Created InterOp file '"./InterOp/IndexMetricsOut.bin"
and sat for hours using <1% CPU writing to IndexMetricsOut.bin
very slowly. Tracking the size of the IndexMetricsOut.bin
file showed that it was being continuously written to at the rate of about 4kb per second. It did eventually finish (final size 49M) and bcl2fastq reported that it completed successfully, but it seems crazy to me that this stage could take so much longer than actually demultiplexing almost half a TB of data.
Does anyone have any ideas why the stats files are taking so long to generate, and what I could do about it? Additionally, since the fastq files seem to be completed, is it safe to just move on to alignment without waiting for bcl2fastq to finish?