How do I know how much memory to use when piping to with BBMap suite (OutOfMemoryError)?
1
0
Entering edit mode
5.0 years ago
O.rka ▴ 740
pv trimmomatic.fastq.gz | reformat.sh  int=f in=stdin.fastq.gz out=stdout.fasta | dedupe.sh in=stdin.fasta out=trimmomatic.dedupe.fasta | clumpify.sh  in=stdin.fasta out=trimmomatic.dedupe.clumpify.fasta.gz

My gzipped file size is 40G. I specified 250G on our server but still got the issue. I tried then using -Xmx250g but it says there's not enough space for some reason.

Does anyone know how to do this to avoid the Out of Memory issue?

Executing jgi.ReformatReads [int=f, in=stdin.fastq.gz, out=stdout.fasta]

Set INTERLEAVED to false
Input is being processed as unpaired
Executing jgi.Dedupe [in=stdin.fasta, out=trimmomatic.dedupe.fasta]
Version 38.73

Initial:
Memory: max=28789m, total=28789m, free=28777m, used=12m


Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-0"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-2"
bbmap java memory reads • 2.0k views
ADD COMMENT
3
Entering edit mode
5.0 years ago
h.mon 35k

With their default settings, both clumpify.sh and dedupe.sh hold all sequences in memory. Both commands have options to process files that do not fit in memory, check their user guides and documentation.

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/clumpify-guide/

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/dedupe-guide/

Clumpify can perform deduplication, so piping from dedupe.sh to clumpify.sh is unnecessary.

ADD COMMENT
0
Entering edit mode

I didn't know that clumpify can dedupe as well! Ill look through the documentation on getting clumpify to work for large sequence sets. I have had some issues where I run out of memory and I'm confused on how to know exactly how much memory to request based on the original file size. Is that in the documentation?

ADD REPLY
1
Entering edit mode

Current versions of Clumpify will write temp files so you don't need to have everything in memory for deduplication. However, it won't know how much memory it needs when piping because it has no idea how much data there is. So you need to set a flag "groups=50" or something like that to make it write 50 temp files. Each one of those needs to fit in memory and they should split roughly equally. So, if you have a 100GB (uncompressed) fastq file and 10 GB RAM, you'd probably want to use at least 40 groups; that would allow ~2.5GB per file if you had 2 files in memory at any given time and allow 50% overhead. More groups would be safer. There's no real penalty.

Anyway, it's always safest and simplest to run without streaming for programs that need to store a lot of data in memory.

ADD REPLY

Login before adding your answer.

Traffic: 2728 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6