Split loop into chunks and execute in parallel in shell scripting
2
1
Entering edit mode
10.1 years ago
Bioinfguy ▴ 30

Suppose I have the following script:

#!/bin/bash

for I in $(cat $1); do sample+=($i); done
for x in $(cat $2); do lenn+=($x); done

tLen=${#sample[@]}

for (( i=0; i<${tLen}; i++ ));
do
    echo "${sample[$i]}.1111 ${lenn[$i]}.1111"
    echo "${sample[$i]}.2222 ${lenn[$i]}.2222"
done

and I have 2 files

capital.txt and small.txt:

capital.txt

A
B
C
D

Small.txt

a
b
c
d

The output will be:

A.1111 a.1111
A.2222 a.2222
B.1111 b.1111
B.2222 b.2222
C.1111 c.1111
C.2222 c.2222
D.1111 d.1111
D.2222 d.2222

suppose that I want to parallelize the loop so that I got the output in parallel:

A.1111 a.1111
A.2222 a.2222
B.1111 b.1111
B.2222 b.2222

and

C.1111 c.1111
C.2222 c.2222
D.1111 d.1111
D.2222 d.2222

should be executed in parallel. for sure for printing something on a screen takes nothing time, but I suppose that each command takes 10 minutes, and if I ran this in series the total time will be 80 min, if I split the commands in 2 halves it will take 40 mins, if in 4 quarters it wil take 20 mins. how to run these in parallel knowing that it is important to run each letter in series because it is dependent on the other, i.e

A.1111 a.1111
A.2222 a.2222

should be in series..

Thanks

loop shell • 9.1k views
ADD COMMENT
1
Entering edit mode

" that it is important to run each letter in series because it is dependent on the other" sounds like a job for Makefile, with opion ' -j '

ADD REPLY
6
Entering edit mode
10.1 years ago

The simplest way to do this sort of task is with gnu parallel, which is a very powerful tool to launch and coordinate multiple independent tasks:

#!/bin/bash
parallel --xapply -a $1 -a $2 echo {1}.1111 {2}.1111
parallel --xapply -a $1 -a $2 echo {1}.2222 {2}.2222

This will run all of the first jobs first in parallel, and then all of the second jobs first in parallel; it guarantees your constraint is met, but it's a bit heavy handed (waiting until _all_ of the first jobs are done untill _any_ of the second jobs are done):

$ ./parallel-script capital.txt small.txt
A.1111 a.1111
B.1111 b.1111
C.1111 c.1111
D.1111 d.1111
A.2222 a.2222
B.2222 b.2222
C.2222 c.2222
D.2222 d.2222

You could also have each processor do both dependant in order:

#!/bin/bash
parallel --xapply -a $1 -a $2 "echo {1}.1111 {2}.1111; echo {1}.2222 {2}.2222"
$ ./script capital.txt small.txt 
A.1111 a.1111
A.2222 a.2222
B.1111 b.1111
B.2222 b.2222
C.1111 c.1111
C.2222 c.2222
D.1111 d.1111
D.2222 d.2222

The tutorial on biostars shows how this can be used to run across nodes, and how to set the number of processes to run on.

It's possible to do this without gnu-parallel, of course, but it's instructive to see how much more complicated it is. So make -jis of course an old standby, made a little more complicated to use here because we need to get two arguments to the makefile. Here we write a script to build a Makefile (for which we'll require a fairly new gnu make:)

#!/bin/bash

makefile="Makefile"

jobs=("1111" "2222")
items=$( paste -d_ $1 $2 )

njobs=${#jobs[@]}
let lastjob=${njobs}-1

echo -n "all: " > ${makefile}
for (( job=0; job<=${lastjob}; job++ ))
do
    for item in ${items}
    do
        echo -n "${item}.${jobs[$job]} " >> ${makefile}
    done
done
echo "" >> ${makefile}
echo "" >> ${makefile}

echo 'firstfile  = $(firstword $(subst _, , $1))$(strip $(2))' >> ${makefile}
echo 'secondfile = $(word 2,$(subst _, , $1))$(strip $(2))' >> ${makefile}
echo "" >> ${makefile}

for (( job=0; job<=${lastjob}; job++ ))
do
    let pjob=${job}-1
    if [ $job == 0 ]
    then
        echo "%.${jobs[$job]}:" >> ${makefile}
    else
        echo "%.${jobs[$job]}: %.${jobs[$pjob]}" >> ${makefile}
    fi
    echo '  @echo $(call firstfile, $(basename $@), $(suffix $@)) \' >> ${makefile}
    echo '          $(call secondfile, $(basename $@), $(suffix $@)) ' >> ${makefile}
    echo '  touch $@' >> ${makefile}
    echo " " >> ${makefile}
done

Running it gives:

$ ./makemakefile capital.txt small.txt
$ cat Makefile
all: A_a.1111 B_b.1111 C_c.1111 D_d.1111 A_a.2222 B_b.2222 C_c.2222 D_d.2222

firstfile  = $(firstword $(subst _, , $1))$(strip $(2))
secondfile = $(word 2,$(subst _, , $1))$(strip $(2))

%.1111:
    @echo $(call firstfile, $(basename $@), $(suffix $@)) \
            $(call secondfile, $(basename $@), $(suffix $@))
    touch $@

%.2222: %.1111
    @echo $(call firstfile, $(basename $@), $(suffix $@)) \
            $(call secondfile, $(basename $@), $(suffix $@))
    touch $@

$ make -j 3
A.1111 a.1111
B.1111 b.1111
touch A_a.1111
touch B_b.1111
C.1111 c.1111
touch C_c.1111
D.1111 d.1111
touch D_d.1111
A.2222 a.2222
touch A_a.2222
B.2222 b.2222
touch B_b.2222
C.2222 c.2222
touch C_c.2222
D.2222 d.2222
touch D_d.2222

$ rm *_*

Note here we've created phony targets and created them with touch; your workflow will likely produce real files that you can use as dependencies instead.

Finally, you can even just launch multiple processes on the same machine with ampersand, and wait for them to complete. You might need to stick another wait in there to make sure the jobs you need are complete:

$ cat ./makerunscript

#!/bin/bash

NPROCS=3
jobscript="jobscript.sh"

echo "#!/bin/bash" > $jobscript

let count=0
for job in "1111" "2222"
do
    for item in $( paste -d_ $1 $2 )
    do
        left=$( echo $item | sed -e 's/^\([^_]*\)_.*/\1/' )
        right=$( echo $item | sed -e 's/[^_]*_\(.*\)/\1/' )
        echo "echo ${left}.${job} ${right}.${job} &" >> $jobscript
        let count=count+1
        if [ $(( count % NPROCS )) == 0 ]
        then
            echo "wait" >> $jobscript
        fi
    done
    echo "wait # make sure all earlier jobs done" >> $jobscript
done
echo "wait # make sure all jobs done" >> $jobscript

$ ./makerunscript capital.txt small.txt
$ cat jobscript.sh
#!/bin/bash
echo A.1111 a.1111 &
echo B.1111 b.1111 &
echo C.1111 c.1111 &
wait
echo D.1111 d.1111 &
wait # make sure all earlier jobs done
echo A.2222 a.2222 &
echo B.2222 b.2222 &
wait
echo C.2222 c.2222 &
echo D.2222 d.2222 &
wait # make sure all earlier jobs done
wait # make sure all jobs done

$ source jobscript.sh
A.1111 a.1111
[1]   Done                    echo A.1111 a.1111
B.1111 b.1111
[2]-  Done                    echo B.1111 b.1111
C.1111 c.1111
[3]+  Done                    echo C.1111 c.1111
D.1111 d.1111
[1]+  Done                    echo D.1111 d.1111
A.2222 a.2222
[1]-  Done                    echo A.2222 a.2222
B.2222 b.2222
[2]+  Done                    echo B.2222 b.2222
C.2222 c.2222
D.2222 d.2222
[1]-  Done                    echo C.2222 c.2222
[2]+  Done                    echo D.2222 d.2222

So it's certainly possible to do with tools other than gnu-parallel, but it sure is a lot easier to just make sure gnu-parallel is installed.

ADD COMMENT
0
Entering edit mode

The Makefile is too simplified: Bioinfguy prints to stdout, so you also need to make sure the output is not mixed (e.g. half a line from one process, other half from another: http://www.gnu.org/software/parallel/man.html#DIFFERENCES-BETWEEN-xargs-AND-GNU-Parallel)

ADD REPLY
0
Entering edit mode

There might be all sorts of additional complications (or maybe even simplifications) depending on what the OPs real workload is, yes; presumably it's a lot more complicated than "echo" :)

ADD REPLY
2
Entering edit mode
10.1 years ago

Answer + @Jonathan I don't get why creating the Makefile is so complicated or I don't get something which is possible. The following Makefile:

pairs=$(shell  paste -d_ jeter1.txt jeter2.txt )

define method1

$(addsuffix .111,$(1)) :
    touch $$@

$(addsuffix .222,$(1)) :
    touch $$@

endef

define method2

$(foreach S, $(subst _, ,$(1)) ,$(eval $(call method1, $(S) )))

$(addsuffix .111,$(1)) : $(addsuffix .111,$(subst _, ,$(1)))
    touch $$@

$(addsuffix .222,$(1)) :  $(addsuffix .111,$(1)) $(addsuffix .222,$(subst _, ,$(1)))
    touch $$@

endef

all: $(addsuffix .222,$(pairs))

$(foreach P, $(pairs),$(eval $(call method2, $(P) )))

produces the desired output:

~$ make -n

touch A.111
touch a.111
touch A_a.111
touch A.222
touch a.222
touch A_a.222
touch B.111
touch b.111
touch B_b.111
touch B.222
touch b.222
touch B_b.222
touch C.111
touch c.111
touch C_c.111
touch C.222
touch c.222
touch C_c.222
touch D.111
touch d.111
touch D_d.111
touch D.222
touch d.222
touch D_d.222
ADD COMMENT
0
Entering edit mode

Ah - reading in the files and iterating over them directly in the Makefile is much better - it saves a step and makes things substantially simpler. But you will still need to break the items (eg, A_a) apart into their constituents so that the OP can run their intended "[command] A.111 a.111" - for which you could use the firstfile/secondfile routines above. Alternately you could loop directly over the pairs of corresponding items, which is likely possible, I just don't see how to do it right now.

I guess you might also have to foreach over 111/222 etc - I was assuming that the real workload might involve an arbitrary number of steps, but that's probably more general than needed.

ADD REPLY
0
Entering edit mode

updated my Makefile

ADD REPLY

Login before adding your answer.

Traffic: 1886 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6