Two-Stage Multithreaded Version for Sorted BAMs
While this thread already has some great answers, I wanted to suggest a parallelized version that is robust to open file limits (e.g., > 4096 files). This requires GNU parallel.
Code
TMPDIR=/scratch find $BAM_DIR -name '*.bam' |
parallel -j8 -N4095 -m --files samtools merge -u - |
parallel --xargs samtools merge -@8 merged.bam {}";" rm {}
Overview
This will take all BAM files in $BAM_DIR
and run eight (-j8
) separate single-threaded merge operations, with the input files (mostly) equally distributed among the different jobs. This results in temporary files which are then merged into merged.bam
in a multithreaded operation. The temporary files are deleted at the end.
Options
One need not keep the number of simultaneous merge operations in the first round of merging (-j8
) in correspondence with the number of threads used for the second round (-@8
). It's likely the first round will be bottlenecked by too much simultaneous writing, so you may want to keep that lower.
Use the -N
flag to change the maximum number of arguments to be given to each first round merge operation. Here 4095 is just the common open files limit minus one (for the output file).
The -u
flag is there so the temporary files will be uncompressed, since we're deleting them in the end. That can be removed if you have concerns about storage space for the temp files.
The TMPDIR
environment variable controls where the intermediates are written. On most Linux systems, this is set to /tmp
and usually corresponds to RAM. In the above example, we show how to transiently override this to instead point to /scratch
- a hypothetical (fast and spacious) scratch space.
That should get you around the 4092 files problem (which will be a command line length limit in your shell, if I understand things correctly)
Perhaps more like this:
But make sure to use backticks around the find statement. They got scrubbed from the comment for some reason
This works but still complains if files are >700
Merge in two or three steps keeping files below 700 (if that is the limit on version of
samtools
you are using).Thanks very much. I don't have them on the same dir, but this works: ~/samtools merge all.bam ~/mydirs/??/??/??/mybam.*.bam
@DocRoberson: it seems like the find is actually not needed, because the shell is already expanding the regexp, at least if it's a few thousand files.
@DocRoberson: it seems like the find is actually not needed, because the shell is already expanding the regexp. It works for up to 4092 files in my terminal.
Tried this strategy but now I cant index the bam file