pv trimmomatic.fastq.gz | reformat.sh int=f in=stdin.fastq.gz out=stdout.fasta | dedupe.sh in=stdin.fasta out=trimmomatic.dedupe.fasta | clumpify.sh in=stdin.fasta out=trimmomatic.dedupe.clumpify.fasta.gz
My gzipped file size is 40G. I specified 250G on our server but still got the issue. I tried then using -Xmx250g but it says there's not enough space for some reason.
Does anyone know how to do this to avoid the Out of Memory issue?
Executing jgi.ReformatReads [int=f, in=stdin.fastq.gz, out=stdout.fasta]
Set INTERLEAVED to false
Input is being processed as unpaired
Executing jgi.Dedupe [in=stdin.fasta, out=trimmomatic.dedupe.fasta]
Version 38.73
Initial:
Memory: max=28789m, total=28789m, free=28777m, used=12m
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-0"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-2"
I didn't know that clumpify can dedupe as well! Ill look through the documentation on getting clumpify to work for large sequence sets. I have had some issues where I run out of memory and I'm confused on how to know exactly how much memory to request based on the original file size. Is that in the documentation?
Current versions of Clumpify will write temp files so you don't need to have everything in memory for deduplication. However, it won't know how much memory it needs when piping because it has no idea how much data there is. So you need to set a flag "groups=50" or something like that to make it write 50 temp files. Each one of those needs to fit in memory and they should split roughly equally. So, if you have a 100GB (uncompressed) fastq file and 10 GB RAM, you'd probably want to use at least 40 groups; that would allow ~2.5GB per file if you had 2 files in memory at any given time and allow 50% overhead. More groups would be safer. There's no real penalty.
Anyway, it's always safest and simplest to run without streaming for programs that need to store a lot of data in memory.