What Is The Best Way To Run Bedtools In Parallel With Blocking
3
3
Entering edit mode
11.9 years ago
Ying W ★ 4.3k

Say I am working on a server with a shared file system and 4 quad core nodes (I/O is not an issue, 16 cores total). I want to run coverageBed across 20 files. Currently I have a shell script that would do this sequentially. It is possible to just background the command so they run in parallel but I am not sure how to block in BASH. (next step requires counting between the files) Assuming I/O is not a bottleneck, what are ways of leveraging the advantage of multiple nodes/cores when running bedtools (or any other sequential commands for that matter).

From my rudimentary understanding of parallel programming the concept I am trying to get at is how do you 'block' so that that the next command after coverageBed will not be executed until all coverageBed runs are done.

I was thinking of wrapping the shell commands in a python script and having queue of coverageBed commands and a function to feed commands 4 at a time (since quad cores) and the function would only return when queue is empty. Is there a better way of doing this?

bedtools parallel • 5.3k views
ADD COMMENT
1
Entering edit mode

Guess you need the "wait" bash command?

ADD REPLY
1
Entering edit mode

thanks, that is exactly what I needed

ADD REPLY
8
Entering edit mode
11.9 years ago
Fred ▴ 790

You can use the tool parallel like follows:

ls *.bed | parallel -j 4 'bedtool ... {} > {}.coverage'

The syntax is close to that of 'find'

ADD COMMENT
3
Entering edit mode
11.9 years ago

The tool you need is the venerable 'make' with the option '-j,

   -j [jobs], --jobs[=jobs]

        Specifies the number of jobs (commands) to run simultaneously.

here is a Makefile as an example. the coverage files can be generated in parallel, but merge.txt will have to wait for them:

#pattern rule: how to make a *.coverage from a *.bed
%.coverage:%.bed
    echo "do something with $< and generate $@" > $@

#your list of *.bed
bed.files=file1.bed file2.bed file3.bed file4.bed

#special targets that don't really exist
.PHONY: all clean

#top target 'all' needs merge.txt
all: merge.txt

# target 'merge.txt' needs the coverage files 
merge.txt: $(patsubst %.bed,%.coverage,$(bed.files))
    #loop over the bed file and create merge.txt
    echo "creating $@ from $(foreach BED,$^, bed file ${BED})" > $@

#cleanup
clean:
    rm -f merge.txt $(patsubst %.bed,%.coverage,$(bed.files))

invoking make -j 4 should produce the following output:

$ make -j 4 

echo "do something with file1.bed and generate file1.coverage" > file1.coverage
echo "do something with file2.bed and generate file2.coverage" > file2.coverage
echo "do something with file3.bed and generate file3.coverage" > file3.coverage
echo "do something with file4.bed and generate file4.coverage" > file4.coverage
#loop over the bed file and create merge.txt
echo "creating merge.txt from  bed file file1.coverage  bed file file2.coverage  bed file file3.coverage  bed file file4.coverage" > merge.txt
ADD COMMENT
2
Entering edit mode
11.9 years ago

As a general approach with a job scheduler, say Sun Grid Engine, there is qsub with the -hold_jid option to submit a task that waits for the successful completion of parent qsub-ed grid jobs.

ADD COMMENT
1
Entering edit mode

You can also use qmake instead of qsub. qmake works with a standard Makefile.

ADD REPLY

Login before adding your answer.

Traffic: 2495 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6