Operations on whole-genome scale BED and BAM files can often be trivially parallelized by chromosome.
BEDOPS tools include routines for restricting work to one chromosome of records in a BED file, which can save a lot of time when used in conjunction with a compute cluster.
Here's one way to build up a pipeline for doing things one chromosome at a time with BEDOPS.
First, we convert the BAM file to BED. This conversion script is part of BEDOPS and uses a SLURM-based cluster, but BEDOPS includes equivalent scripts for SGE and GNU Parallel conversion:
$ bam2bed_slurm reads.bam reads.bed
Parallelization reduces conversion time to the largest chromosome (usually chr1
).
Once parallelized conversion finishes, the following skeleton can be parallelized easily for each chromosome ${chr}
:
$ chr="chr1"
$ bedmap --chrom ${chr} --echo --echo-map --skip-unmapped regions.bed reads.bed > answer.${chr}.bed
Using --chrom ${chr}
with bedmap
restricts the set operation work to that chromosome, instead of reading through the entire file.
This is going to be great for what you're doing, because sequencing reads do not span chromosomes, and so we can map with bedmap
on individual chromosomes.
Via bash, this for
loop would process the work in serial, one chromosome at a time:
$ for chr in `bedextract --list-chr reads.bed`; do echo ${chr}; bedmap --chrom ${chr} --echo --echo-map --skip-unmapped regions.bed reads.bed > answer.${chr}.bed; done
The bedextract --list-chr
tool uses a binary search to delimit byte bounds between chromosomes in the input BED file. This application takes a fraction of a second to build a list of chromosomes. We can use that list in a for
loop to submit per-chromosome jobs to a compute cluster.
For instance, via a SLURM job scheduler, we can fire off one job per chromosome:
$ for chr in `bedextract --list-chr reads.bed`; do echo ${chr}; sbatch --parsable --partition="some_queue" --mem-per-cpu=1g --job-name="overlaps_${chr}" --wrap="module add bedops && bedmap --chrom ${chr} --echo --echo-map --skip-unmapped regions.bed reads.bed > answer.${chr}.bed"; done
The time taken for all operations is reduced considerably, to basically only that needed by the largest chromosome.
To generate a collated result, we can use sbatch --parsable
to get back the job IDs, and set up a collate job at the end with --dependency=afterok:<jobs>
to run a bedops --everything
union step at the end.