Question

What Is The Best Way To Run Bedtools In Parallel With Blocking

3

Entering edit mode

11.9 years ago

Ying W ★ 4.3k

Say I am working on a server with a shared file system and 4 quad core nodes (I/O is not an issue, 16 cores total). I want to run coverageBed across 20 files. Currently I have a shell script that would do this sequentially. It is possible to just background the command so they run in parallel but I am not sure how to block in BASH. (next step requires counting between the files) Assuming I/O is not a bottleneck, what are ways of leveraging the advantage of multiple nodes/cores when running bedtools (or any other sequential commands for that matter).

From my rudimentary understanding of parallel programming the concept I am trying to get at is how do you 'block' so that that the next command after coverageBed will not be executed until all coverageBed runs are done.

I was thinking of wrapping the shell commands in a python script and having queue of coverageBed commands and a function to feed commands 4 at a time (since quad cores) and the function would only return when queue is empty. Is there a better way of doing this?

bedtools parallel • 5.4k views

ADD COMMENT • link updated 11.9 years ago by Alex Reynolds 36k • written 11.9 years ago by Ying W ★ 4.3k

1

Entering edit mode

Guess you need the "wait" bash command?

ADD REPLY • link 11.9 years ago by lh3 33k

1

Entering edit mode

thanks, that is exactly what I needed

ADD REPLY • link 11.9 years ago by Ying W ★ 4.3k

score 8 · Answer 1 · 2013-01-16

8

Entering edit mode

11.9 years ago

Fred ▴ 790

You can use the tool parallel like follows:

ls *.bed | parallel -j 4 'bedtool ... {} > {}.coverage'

The syntax is close to that of 'find'

ADD COMMENT • link 11.9 years ago by Fred ▴ 790

score 3 · Answer 2 · 2013-01-16

The tool you need is the venerable 'make' with the option '-j,

   -j [jobs], --jobs[=jobs]

        Specifies the number of jobs (commands) to run simultaneously.

here is a Makefile as an example. the coverage files can be generated in parallel, but merge.txt will have to wait for them:

#pattern rule: how to make a *.coverage from a *.bed
%.coverage:%.bed
    echo "do something with $< and generate $@" > $@

#your list of *.bed
bed.files=file1.bed file2.bed file3.bed file4.bed

#special targets that don't really exist
.PHONY: all clean

#top target 'all' needs merge.txt
all: merge.txt

# target 'merge.txt' needs the coverage files 
merge.txt: $(patsubst %.bed,%.coverage,$(bed.files))
    #loop over the bed file and create merge.txt
    echo "creating $@ from $(foreach BED,$^, bed file ${BED})" > $@

#cleanup
clean:
    rm -f merge.txt $(patsubst %.bed,%.coverage,$(bed.files))

invoking make -j 4 should produce the following output:

$ make -j 4 

echo "do something with file1.bed and generate file1.coverage" > file1.coverage
echo "do something with file2.bed and generate file2.coverage" > file2.coverage
echo "do something with file3.bed and generate file3.coverage" > file3.coverage
echo "do something with file4.bed and generate file4.coverage" > file4.coverage
#loop over the bed file and create merge.txt
echo "creating merge.txt from  bed file file1.coverage  bed file file2.coverage  bed file file3.coverage  bed file file4.coverage" > merge.txt

score 2 · Answer 3 · 2013-01-16

2

Entering edit mode

11.9 years ago

Alex Reynolds 36k

As a general approach with a job scheduler, say Sun Grid Engine, there is qsub with the -hold_jid option to submit a task that waits for the successful completion of parent qsub-ed grid jobs.