Two-Stage Multithreaded Version for Sorted BAMs

Question

Best Way To Merge A Many Thousand Small Bam Files Into One Big Bam File?

26

Entering edit mode

13.5 years ago

2184687-1231-83- ★ 5.1k

I've got a few thousand small bam files produced against the exact same reference, and I want to merge them into one single big bam file. What is the best way to do that?

Should I do this iteratively or can I pass a long list of bam files to samtools/picard/etc in one go?

Edited, since this is now partially solved. In my terminal, the methods below works for up to 4092 files. More than that raises an error:

samtools merge all.bam *.bam
samtools merge all.bam `find /basedir/ -name "*myfiles*.bam"`
samtools merge all.bam /basedir/*/???/*myfiles*.bam

bam picard samtools merge • 102k views

ADD COMMENT • link updated 12 months ago by mmfansler ▴ 460 • written 13.5 years ago by 2184687-1231-83- ★ 5.1k

11

Entering edit mode

13.4 years ago

Ryan Thompson ★ 3.6k

This should work on an arbitrary number of bam files (i.e. more than 4096).

find $BAM_DIR -name '*.bam' | {
    read firstbam
    samtools view -h "$firstbam"
    while read bam; do
        samtools view "$bam"
    done
} | samtools view -ubS - | samtools sort - merged
samtools index merged.bam
ls -l merged.bam merged.bam.bai

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 13.4 years ago by Ryan Thompson ★ 3.6k

1

Entering edit mode

This is the best technique... samtools merge is wonky/segfaults sometimes.

ADD REPLY • link 12.1 years ago by earonesty ▴ 250

1

Entering edit mode

the most amazing bash script I have ever seen. It's amazing

ADD REPLY • link 9.1 years ago by Shicheng Guo ★ 9.6k

1

Entering edit mode

Line 4: 'while read bam' can you interpret what's 'bam' they are in this context?

ADD REPLY • link 8.8 years ago by Shicheng Guo ★ 9.6k

6

Entering edit mode

13.5 years ago

Frenkiboy ▴ 260

If you do not need for the final bam file to be sorted, the fastest way would be samtools cat:

samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [...]

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 13.5 years ago by Frenkiboy ▴ 260

1

Entering edit mode

What's the role of header.sam here? how to creat this sam file?

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.1 years ago by Shicheng Guo ★ 9.6k

0

Entering edit mode

It concatenates the SAM header to the reads. You can create the header with samtools view -h

ADD REPLY • link 5.8 years ago by Kevin Blighe 88k

3

Entering edit mode

6.0 years ago

mmfansler ▴ 460

Two-Stage Multithreaded Version for Sorted BAMs

While this thread already has some great answers, I wanted to suggest a parallelized version that is robust to open file limits (e.g., > 4096 files). This requires GNU parallel.

Code

TMPDIR=/scratch find $BAM_DIR -name '*.bam' |
  parallel -j8 -N4095 -m --files samtools merge -u - |
  parallel --xargs samtools merge -@8 merged.bam {}";" rm {}

Overview

This will take all BAM files in $BAM_DIR and run eight (-j8) separate single-threaded merge operations, with the input files (mostly) equally distributed among the different jobs. This results in temporary files which are then merged into merged.bam in a multithreaded operation. The temporary files are deleted at the end.

Options

One need not keep the number of simultaneous merge operations in the first round of merging (-j8) in correspondence with the number of threads used for the second round (-@8). It's likely the first round will be bottlenecked by too much simultaneous writing, so you may want to keep that lower.

Use the -N flag to change the maximum number of arguments to be given to each first round merge operation. Here 4095 is just the common open files limit minus one (for the output file).

The -u flag is there so the temporary files will be uncompressed, since we're deleting them in the end. That can be removed if you have concerns about storage space for the temp files.

The TMPDIR environment variable controls where the intermediates are written. On most Linux systems, this is set to /tmp and usually corresponds to RAM. In the above example, we show how to transiently override this to instead point to /scratch - a hypothetical (fast and spacious) scratch space.

ADD COMMENT • link 12 months ago by mmfansler ▴ 460

2

Entering edit mode

Really elegant :) Still, as the temporary files that are created (*.par) will be stored in memory (/tmp) running this on many files might create memory issues, especially as the intermediates are uncompressed. I did a quick test with 5000 files, 64000 reads (50bp) each and this gave about 3Gb per intermediate file after the first merge, so this will for sure overload the memory once the files get larger. It might be better to store the intermediate files after the first merge to disk to reduce the number of files and then run the second merge separately.

ADD REPLY • link 6.0 years ago by ATpoint 86k

1

Entering edit mode

Yes, good point! That is effectively what I do in practice. GNU parallel respects the TMPDIR environment variable, so if you have that specified to a local scratch disk, that's where the intermediates will go.

ADD REPLY • link 6.0 years ago by mmfansler ▴ 460

1

Entering edit mode

8.8 years ago

Shicheng Guo ★ 9.6k

Share a perl script to merge separate Bams by chrosomes (chr1-chr2,chrX,chrY,chrM)

#!/usr/bin/perl
# merge bam file by different chrosome. 
# Bam File: /home/sguo/bam 
# OUT Dir:  /home/sguo/mergeBam
use strict;
use Cwd;
my $bamdir="/home/sguo/bam";
chdir $bamdir;
my @file=glob("*bam");
my $outdir="/home/sguo/mergeBam";
my %sam;
foreach my $file(@file){
        my ($sam,undef)=split /\./,$file;
        $sam{$sam}=$sam;
}
foreach my $sam(sort keys %sam){
    open OUT, ">$outdir/$sam.bam.merge.sh";
    print OUT "cd $bamdir\n";
    my @file=glob("$sam*bam");
    my $file=join(" ",@file);
    print "$sam.bam.merge.sh\n";
    print OUT "samtools view -H $file[0]>$outdir/$sam.header\n";
    print OUT "samtools cat -h $outdir/$sam.header -o $outdir/$sam.bam $file\n";
    print OUT "samtools index $outdir/$sam.bam\n";
    close OUT ;         
    system("sh $outdir/$sam.bam.merge.sh &");   
}

ADD COMMENT • link 8.8 years ago by Shicheng Guo ★ 9.6k

3

Entering edit mode

a perl script generating a bash script calling a C program ?

ADD REPLY • link 8.8 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

I still pretty much bank on simple looping in bash or shell rather than create an entire perl script for doing such purpose, if a bash script can directly read all my file at one go and merge them, although yes the computational expense is something am not considering here as per the memory and the time usage, but I believe it should be less.

ADD REPLY • link 8.8 years ago by ivivek_ngs ★ 5.2k

1

Entering edit mode

I'd like Ryan Thompson's bash script, but, It reports some warnings for my case. I don't know why? Anybody meet any warning?

ADD REPLY • link 8.8 years ago by Shicheng Guo ★ 9.6k

1

Entering edit mode

It would help to know what the warnings were ...

ADD REPLY • link 8.8 years ago by george.ry ★ 1.2k

0

Entering edit mode

enter image description here

ADD REPLY • link 8.8 years ago by Shicheng Guo ★ 9.6k

1

Entering edit mode

The post that you copied is >4 years old, so it's using the syntax from samtools 0.1. The latest versions of samtools use a different syntax for samtools sort ... which is exactly what the error says.

Change

samtools sort - merged

to

samtools sort -o merged.bam -

You will probably also want to include some multithreading (-@), so read that usage information carefully.

ADD REPLY • link 8.8 years ago by george.ry ★ 1.2k

0

Entering edit mode

I notice it once I post it. shame on myself...

ADD REPLY • link 8.8 years ago by Shicheng Guo ★ 9.6k

0

Entering edit mode

I have sorted 1679 sorted bam files like sorted.bam.0000.bam to sorted.bam.1679.bam. how do i merge all sorted bam files into single sorted bam file. Can you please give me the script by using my files examples. Thank you

ADD REPLY • link 8.5 years ago by amazingworldpr • 0

0

Entering edit mode

If those are pieces from the same bam file that are created during sorting then they should be automatically deleted once the sort process is complete. If they are still around then it is likely that your sorting did not complete properly. You can delete the pieces and then redo the sorting.

ADD REPLY • link 8.5 years ago by GenoMax 148k

Ram · Accepted Answer · 2011-07-05

22

Entering edit mode

13.5 years ago

Stefano Berri 4.4k

Both are possible, but the second is probably best and easier

Put all your bam files in a folder (or create a folder with symbolic links to all sam files you want merge) and then

samtools merge finalBamFile.bam *.bam

See here for options about samtools (header and so on)

Editing to address the issue with many file.

You can take a three steps approach.

First you print the header

samtools view -H bar/foo/anyFile.bam > allSeq.sam

then you append sequences from all files using find

find ./ -name \*.bam -exec samtools view {} \; >> allSeq.sam

and finally convert to bam

samtools view -bh allSeq.sam > allSeq.bam

I can't quite test it at the moment, but should work

ADD COMMENT • link updated 3.4 years ago by Ram 44k • written 13.5 years ago by Stefano Berri 4.4k

4

Entering edit mode

man xargs

That should get you around the 4092 files problem (which will be a command line length limit in your shell, if I understand things correctly)

ADD REPLY • link 13.5 years ago by Alaw ▴ 90

2

Entering edit mode

Perhaps more like this:

samtools merge all.bam `find /mydirs/ -name *.bam`

ADD REPLY • link updated 3.4 years ago by Ram 44k • written 13.5 years ago by Docroberson ▴ 310

2

Entering edit mode

But make sure to use backticks around the find statement. They got scrubbed from the comment for some reason

ADD REPLY • link 13.5 years ago by Docroberson ▴ 310

0

Entering edit mode

Thanks very much. I don't have them on the same dir, but this works: ~/samtools merge all.bam ~/mydirs/??/??/??/mybam.*.bam

ADD REPLY • link 13.5 years ago by 2184687-1231-83- ★ 5.1k

0

Entering edit mode

@DocRoberson: it seems like the find is actually not needed, because the shell is already expanding the regexp, at least if it's a few thousand files.

ADD REPLY • link 13.5 years ago by 2184687-1231-83- ★ 5.1k

0

Entering edit mode

@DocRoberson: it seems like the find is actually not needed, because the shell is already expanding the regexp. It works for up to 4092 files in my terminal.

ADD REPLY • link 13.5 years ago by 2184687-1231-83- ★ 5.1k

Ram · Accepted Answer · 2013-04-16

18

Entering edit mode

11.7 years ago

Erik Garrison ★ 2.4k

Bamtools handles this very cleanly. Note that this properly constructs the header, which might matter if you have different read groups in each file.

bamtools merge -list files.bamlist -out merged.bam

You can also supply a region via the -region flag, using the format 12:2232..3328.

ADD COMMENT • link 11.7 years ago by Erik Garrison ★ 2.4k

1

Entering edit mode

That one worked perfectly. It just included all @RG lines with their respective IDs. No downstream errors (at least for now)

ADD REPLY • link 11.3 years ago by mazzottidr ▴ 30

1

Entering edit mode

Which one? That one worked perfectly?

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 8.9 years ago by Shicheng Guo ★ 9.6k

0

Entering edit mode

Hi!

I have 950 bam files to merge, so your solution interested me very much! But my bamtools merge program does not have a -list option. Which version of bamtools do you have?

Thank you
Maria

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 10.6 years ago by maria.bernard1301 • 0