Hi everybody,
Like other NGS data analysts, as a new user i am also utilizing java applications designed for NGS data analysis. However, i saw various Java jar's optimizing parameters in command lines while processing large data sets. For example, recently i have come across two example commands, i.e.:
java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq ...
java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment ... .
Parameters: -Dsamjdk.buffer_size -XX:GCTimeLimit -XX:GCHeapFreeLimit -Xmx128m -XX:+UseStringCache -Dsamjdk.use_async_io=true
i try to learn about them from internet but not clear about how to set these parameters (bold text in above mentioned command-line) for large data sets while piping various tools together. As i have biological background, could anyone please explain a bit in detail how to use them with one-line comprehensive definition and purpose.
Thank you very much!
Hi Pierre,
Thank you for explaining each aspect comprehensively. :) However, i have one confusion is that memory allocation for buffers in JAVA comes after from heap's allocated memory, or is it independently assigned in JVM?
buffer like samjdk.buffer_size is the size of memory that the htsjdk library will use for storing short-reads in memory. For example, when writing, the reads are stored in a memory buffer. When this buffer is full, the reads are written to disk. The largest it is, the fastest is your application (reduce I/O) but the more you need memory (heap)