I'm a big fan of the BBTools suite, but I'm checking that I'm not doing anything silly here. I have several terabytes of decontaminated, QC'd, error corrected reads that I'd like to normalize prior to cross assembly on the agglomerated reads. I ran the loglog.sh script within the BBTools suite and the estimated cardinality required ~2.1TB of RAM to generate a table less than 50% full, as per Brian's recommendation. Our HPC has only 2TB RAM on the high capacity nodes, which I didn't think would be an issue, but BBnorm keeps hanging during the second pass (of the default 2), I assume this is due to I/O issues when I'm using the full resource allocation, but it hangs even when I limit the virtual memory to 1.8TB.
I'm open to suggestions regarding this, but my other thought was to just split the large, agglomerated read file into 5 or so equal parts, normalize those in parallel without resource constraints, then merge them and run an additional round of normalization to bring down any kmers that hit my target level (100X) in each subset and would thus be overrepresented in the merged dataset.
Does this seem like a reasonable workaround or am I missing something? Also open to better ideas.
That sounds logically reasonable since the dataset requirements are beyond the limits of available hardware. I wonder if there would be any benefit to using partial overlap between the five pieces when you do round 1 normalization.
Following up on this in case it's of benefit. I performed the iterative normalization approach by splitting the agglomerated file into five parts using seqkit split2; it went pretty slow but worked.
FWIW, the read/write functions implemented by BBTools are at least an order of magnitude faster than seqkit (and many other fasta/q manipulation packages), and this isn't differentially I/O limited as I performed it all on the same host on our cluster. Brian really channeled some black magic here. I compared the speeds to BBTools by splitting reads into parts using reformat (albeit in a round-about way by skipping n reads then grabbing the first x reads from that chunk, several times across the file) and it completed in just a few hours despite having to read billions of reads to get to the extraction point for some splits.
Anyway, the iterative normalization approach worked just fine but possibly the most useful point of my rambling here is to remind folks to CHECK YOUR HASH TABLE OCCUPANCY! I wasn't able to check the error logs of my initial BBnorm jobs that were hanging because the qdel function of the PBS scheduler (argghh) doesn't send a traditional kill signal, which would trigger a job to write standard error output. Out of curiosity, I reran the job and killed it with qsig -s 9 jobid, which sends a normal kill signal that signals standard error output. This let me check the output of BBnorm (written to standard error) and I realized that my hash tables were only about 2% full, despite my cardinality estimation from loglog.sh needing 2.1TB for a < 50% occupied table. The error file also confirmed my suspicion of I/O issues being the culprit for the original job hang. Digging into this a little more, I realized that the issue was due to me giving 95% of my memory allocation of a given node to BBnorm (or, the JVM, probably) and leaving none reserved for java to store its buffered output, which actually consumes a surprisingly large memory footprint that extends beyond the allocation specified by the XmX flag. Thus, I decided to give BBnorm a smaller percent of the total memory allocation (75%) and leave more for I/O buffering and it worked like a charm.
TLDR; check the occupancy of your hash tables and don't give BBnorm or your JVM the vast majority (say > 90%) of your memory allocation for big job with large I/O streams.
That sounds logically reasonable since the dataset requirements are beyond the limits of available hardware. I wonder if there would be any benefit to using partial overlap between the five pieces when you do round 1 normalization.