Job Manager to parallelize otherwise consecutive bash scripts..?
4
5
Entering edit mode
8.8 years ago
John 13k

Hello all :)

I often find myself having shell scripts - or rather, a list of commands to run via bash - which could be run concurrently if only there was a way to do that easily.

For example, here is a snippet one such file:

java -jar /ex/picard/picard.jar FastqToSam \
    DESCRIPTION="Circadian Day" \
    FASTQ="/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L003_R1_001.fastq.gz" \
    FASTQ2="/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L003_R2_001.fastq.gz" \
    OUTPUT="/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L003.bam" \
    SAMPLE_NAME="C57BL/6J" \
    LIBRARY_NAME="Mm01.H3K27ac" \
    READ_GROUP_NAME="39V34V1.D265MACXX.3.CAGATC" \
    SEQUENCING_CENTER="Freiburg" \
    PLATFORM="illumina" \

java -jar /ex/picard/picard.jar FastqToSam \
    DESCRIPTION="Circadian Day" \
    FASTQ="/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L004_R1_001.fastq.gz" \
    FASTQ2="/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L004_R2_001.fastq.gz" \
    OUTPUT="/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L004.bam" \
    SAMPLE_NAME="C57BL/6J" \
    LIBRARY_NAME="Mm01.H3K27ac" \
    READ_GROUP_NAME="39V34V1.D265MACXX.4.CAGATC" \
    SEQUENCING_CENTER="Freiburg" \
    PLATFORM="illumina" \

java -jar /ex/picard/picard.jar FastqToSam \
    DESCRIPTION="Circadian Day" \
    FASTQ="/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L003_R1_001.fastq.gz" \
    FASTQ2="/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L003_R2_001.fastq.gz" \
    OUTPUT="/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L003.bam" \
    SAMPLE_NAME="C57BL/6J" \
    LIBRARY_NAME="Mm01.H3K27ac" \
    READ_GROUP_NAME="39V34V1.D2GD2ACXX.3.CAGATC" \
    SEQUENCING_CENTER="Freiburg" \
    PLATFORM="illumina" \

java -jar /ex/picard/picard.jar FastqToSam \
    DESCRIPTION="Circadian Day" \
    FASTQ="/sequence/44_Mm01_WEAd_C2_H3K27me3_F_1_CGATGT_L003_R1_001.fastq.gz" \
    FASTQ2="/sequence/44_Mm01_WEAd_C2_H3K27me3_F_1_CGATGT_L003_R2_001.fastq.gz" \
    OUTPUT="/sequence/44_Mm01_WEAd_C2_H3K27me3_F_1_CGATGT_L003.bam" \
    SAMPLE_NAME="C57BL/6J" \
    LIBRARY_NAME="Mm01.H3K27me3" \
    READ_GROUP_NAME="39V34V1.D265MACXX.3.CGATGT" \
    SEQUENCING_CENTER="Freiburg" \
    PLATFORM="illumina" \
..

Lets say there were 1000 of such commands in a file called jobs.sh, and I could run it with bash jobs.sh to have each one done one-after-the-other. However, since each job clearly does not rely on one another being completed, it might be nice to run them in parallel, say 5 at a time.

Bash natively has no way to do that without some really complicated logic wrapping each command, deciding when and when not to background a command and proceed to the next.

The Unix tool parallel can do it with something like:

parallel --jobs 5 sh -c {} < jobs.sh

However this will only work if the commands in jobs.sh are one-per-line. Multiline commands like the ones above break it. Furthermore, if the commands behave differently when their stdout/stderr are being redirected, they will detect that parallel is redirecting and, well, act differently - no status output, refuse to run, etc. You can also do this with native Unix xargs (https://www.gnu.org/software/parallel/man.html#DIFFERENCES-BETWEEN-xargs-AND-GNU-Parallel) but its fiddlier and has all the same issues as parallel.

So my question is two-fold.

  1. Does anyone know of a tool to parallelize a list of commands that would otherwise run consecutively in bash with little extra effort. Ideally each command gets run into it's own screen session so we can drop in at any time to see how its getting along (and no stdout/err redirection issues), and also spin up/down more worker processes as we go along, etc.
  2. How do you manage jobs currently? Perhaps you never have to run 100 nearly-identical commands. Perhaps your commands are generated by something else (like Galaxy) which manages jobs for you also. Perhaps you, like I was doing last week, open up 32 new terminal windows to run 1/32th of the commands of jobs.sh in each :P Or perhaps something much more sophisticated.... hopefully :)
parallel xargs bash • 5.1k views
ADD COMMENT
8
Entering edit mode
8.8 years ago

You're looking for GNU make and the -j option. Some clusters use a custom version of make (e.g., 'qmake' for SGE )

Search biostars for some other solutions (snakemake, ... ): e.g: Workflow management software for pipeline development in NGS,

An example of Makefile:

samtools.dir=../samtools/
samtools.exe=${samtools.dir}samtools
wgsim.exe=${samtools.dir}misc/wgsim
bwa.exe=../bwa/bwa
picard.jar=../picard-tools-1.138/picard.jar
bcftools.exe=../bcftools/bcftools

REF=ref.fa

mutations.vcf.gz : ${REF} \
    $(addsuffix .bam.bai ,S1) \
    $(addsuffix .bam.bai ,S2) \
    $(addsuffix .bam.bai ,S3) \
    $(addsuffix .bam.bai ,S4)
    ${samtools.exe} mpileup  -u -f $< $(filter %.bam,$(basename $^)) |\
    ${bcftools.exe} call -c -v -O z -o $@ -


$(addsuffix .bam.bai ,S1):  $(addsuffix .bam,S1)
    ${samtools.exe} index $<

$(addsuffix .bam,S1):  \
        $(addsuffix .bam,S1_01_R1.fq.gz)  \
        $(addsuffix .bam,S1_02_R1.fq.gz)  \
        $(addsuffix .bam,S1_03_R1.fq.gz)
    java -jar ${picard.jar} MergeSamFiles AS=true O=$@ $(foreach B,$^, I=${B} )


$(addsuffix .bam,S1_01_R1.fq.gz): $(addsuffix .bwt,${REF}) S1_01_R1.fq.gz S1_01_R2.fq.gz
    ${bwa.exe} mem  -R '@RG\tID:S1\tSM:S1' ${REF} $(filter %.gz,$^) |\
    ${samtools.exe} view -Sbu  - |\
    ${samtools.exe} sort -O sam -o $@ -T $(basename $@) -

S1_01_R2.fq.gz : S1_01_R1.fq.gz
    gzip --best -f $(basename $@)

S1_01_R1.fq.gz  : ${REF}
    @sleep 1
    ${wgsim.exe} -r 0.01 -S 1 -N 1000  $< $(basename S1_01_R1.fq.gz S1_01_R2.fq.gz)
    gzip --best -f $(basename $@)


$(addsuffix .bam,S1_02_R1.fq.gz): $(addsuffix .bwt,${REF}) S1_02_R1.fq.gz S1_02_R2.fq.gz
    ${bwa.exe} mem  -R '@RG\tID:S1\tSM:S1' ${REF} $(filter %.gz,$^) |\
    ${samtools.exe} view -Sbu  - |\
    ${samtools.exe} sort -O sam -o $@ -T $(basename $@) -

S1_02_R2.fq.gz : S1_02_R1.fq.gz
    gzip --best -f $(basename $@)

S1_02_R1.fq.gz  : ${REF}
    @sleep 1
    ${wgsim.exe} -r 0.01 -S 1 -N 1000  $< $(basename S1_02_R1.fq.gz S1_02_R2.fq.gz)
    gzip --best -f $(basename $@)


$(addsuffix .bam,S1_03_R1.fq.gz): $(addsuffix .bwt,${REF}) S1_03_R1.fq.gz S1_03_R2.fq.gz
    ${bwa.exe} mem  -R '@RG\tID:S1\tSM:S1' ${REF} $(filter %.gz,$^) |\
    ${samtools.exe} view -Sbu  - |\
    ${samtools.exe} sort -O sam -o $@ -T $(basename $@) -

S1_03_R2.fq.gz : S1_03_R1.fq.gz
    gzip --best -f $(basename $@)

S1_03_R1.fq.gz  : ${REF}
    @sleep 1
    ${wgsim.exe} -r 0.01 -S 1 -N 1000  $< $(basename S1_03_R1.fq.gz S1_03_R2.fq.gz)
    gzip --best -f $(basename $@)


$(addsuffix .bam.bai ,S2):  $(addsuffix .bam,S2)
    ${samtools.exe} index $<

$(addsuffix .bam,S2):  \
        $(addsuffix .bam,S2_01_R1.fq.gz)  \
        $(addsuffix .bam,S2_02_R1.fq.gz)  \
        $(addsuffix .bam,S2_03_R1.fq.gz)
    java -jar ${picard.jar} MergeSamFiles AS=true O=$@ $(foreach B,$^, I=${B} )


$(addsuffix .bam,S2_01_R1.fq.gz): $(addsuffix .bwt,${REF}) S2_01_R1.fq.gz S2_01_R2.fq.gz
    ${bwa.exe} mem  -R '@RG\tID:S2\tSM:S2' ${REF} $(filter %.gz,$^) |\
    ${samtools.exe} view -Sbu  - |\
    ${samtools.exe} sort -O sam -o $@ -T $(basename $@) -

S2_01_R2.fq.gz : S2_01_R1.fq.gz
    gzip --best -f $(basename $@)

S2_01_R1.fq.gz  : ${REF}
    @sleep 1
    ${wgsim.exe} -r 0.01 -S 2 -N 1000  $< $(basename S2_01_R1.fq.gz S2_01_R2.fq.gz)
    gzip --best -f $(basename $@)


$(addsuffix .bam,S2_02_R1.fq.gz): $(addsuffix .bwt,${REF}) S2_02_R1.fq.gz S2_02_R2.fq.gz
    ${bwa.exe} mem  -R '@RG\tID:S2\tSM:S2' ${REF} $(filter %.gz,$^) |\
    ${samtools.exe} view -Sbu  - |\
    ${samtools.exe} sort -O sam -o $@ -T $(basename $@) -

S2_02_R2.fq.gz : S2_02_R1.fq.gz
    gzip --best -f $(basename $@)

S2_02_R1.fq.gz  : ${REF}
    @sleep 1
    ${wgsim.exe} -r 0.01 -S 2 -N 1000  $< $(basename S2_02_R1.fq.gz S2_02_R2.fq.gz)
    gzip --best -f $(basename $@)


$(addsuffix .bam,S2_03_R1.fq.gz): $(addsuffix .bwt,${REF}) S2_03_R1.fq.gz S2_03_R2.fq.gz
    ${bwa.exe} mem  -R '@RG\tID:S2\tSM:S2' ${REF} $(filter %.gz,$^) |\
    ${samtools.exe} view -Sbu  - |\
    ${samtools.exe} sort -O sam -o $@ -T $(basename $@) -

S2_03_R2.fq.gz : S2_03_R1.fq.gz
    gzip --best -f $(basename $@)

S2_03_R1.fq.gz  : ${REF}
    @sleep 1
    ${wgsim.exe} -r 0.01 -S 2 -N 1000  $< $(basename S2_03_R1.fq.gz S2_03_R2.fq.gz)
    gzip --best -f $(basename $@)


$(addsuffix .bam.bai ,S3):  $(addsuffix .bam,S3)
    ${samtools.exe} index $<

$(addsuffix .bam,S3):  \
        $(addsuffix .bam,S3_01_R1.fq.gz)  \
        $(addsuffix .bam,S3_02_R1.fq.gz)  \
        $(addsuffix .bam,S3_03_R1.fq.gz)  \
        $(addsuffix .bam,S3_04_R1.fq.gz)
    java -jar ${picard.jar} MergeSamFiles AS=true O=$@ $(foreach B,$^, I=${B} )


$(addsuffix .bam,S3_01_R1.fq.gz): $(addsuffix .bwt,${REF}) S3_01_R1.fq.gz S3_01_R2.fq.gz
    ${bwa.exe} mem  -R '@RG\tID:S3\tSM:S3' ${REF} $(filter %.gz,$^) |\
    ${samtools.exe} view -Sbu  - |\
    ${samtools.exe} sort -O sam -o $@ -T $(basename $@) -

S3_01_R2.fq.gz : S3_01_R1.fq.gz
    gzip --best -f $(basename $@)

S3_01_R1.fq.gz  : ${REF}
    @sleep 1
    ${wgsim.exe} -r 0.01 -S 3 -N 1000  $< $(basename S3_01_R1.fq.gz S3_01_R2.fq.gz)
    gzip --best -f $(basename $@)


$(addsuffix .bam,S3_02_R1.fq.gz): $(addsuffix .bwt,${REF}) S3_02_R1.fq.gz S3_02_R2.fq.gz
    ${bwa.exe} mem  -R '@RG\tID:S3\tSM:S3' ${REF} $(filter %.gz,$^) |\
    ${samtools.exe} view -Sbu  - |\
    ${samtools.exe} sort -O sam -o $@ -T $(basename $@) -

S3_02_R2.fq.gz : S3_02_R1.fq.gz
    gzip --best -f $(basename $@)

S3_02_R1.fq.gz  : ${REF}
    @sleep 1
    ${wgsim.exe} -r 0.01 -S 3 -N 1000  $< $(basename S3_02_R1.fq.gz S3_02_R2.fq.gz)
    gzip --best -f $(basename $@)


$(addsuffix .bam,S3_03_R1.fq.gz): $(addsuffix .bwt,${REF}) S3_03_R1.fq.gz S3_03_R2.fq.gz
    ${bwa.exe} mem  -R '@RG\tID:S3\tSM:S3' ${REF} $(filter %.gz,$^) |\
    ${samtools.exe} view -Sbu  - |\
    ${samtools.exe} sort -O sam -o $@ -T $(basename $@) -

S3_03_R2.fq.gz : S3_03_R1.fq.gz
    gzip --best -f $(basename $@)

S3_03_R1.fq.gz  : ${REF}
    @sleep 1
    ${wgsim.exe} -r 0.01 -S 3 -N 1000  $< $(basename S3_03_R1.fq.gz S3_03_R2.fq.gz)
    gzip --best -f $(basename $@)


$(addsuffix .bam,S3_04_R1.fq.gz): $(addsuffix .bwt,${REF}) S3_04_R1.fq.gz S3_04_R2.fq.gz
    ${bwa.exe} mem  -R '@RG\tID:S3\tSM:S3' ${REF} $(filter %.gz,$^) |\
    ${samtools.exe} view -Sbu  - |\
    ${samtools.exe} sort -O sam -o $@ -T $(basename $@) -

S3_04_R2.fq.gz : S3_04_R1.fq.gz
    gzip --best -f $(basename $@)

S3_04_R1.fq.gz  : ${REF}
    @sleep 1
    ${wgsim.exe} -r 0.01 -S 3 -N 1000  $< $(basename S3_04_R1.fq.gz S3_04_R2.fq.gz)
    gzip --best -f $(basename $@)


$(addsuffix .bam.bai ,S4):  $(addsuffix .bam,S4)
    ${samtools.exe} index $<

$(addsuffix .bam,S4):  \
        $(addsuffix .bam,S4_01_R1.fq.gz)
    java -jar ${picard.jar} MergeSamFiles AS=true O=$@ $(foreach B,$^, I=${B} )


$(addsuffix .bam,S4_01_R1.fq.gz): $(addsuffix .bwt,${REF}) S4_01_R1.fq.gz S4_01_R2.fq.gz
    ${bwa.exe} mem  -R '@RG\tID:S4\tSM:S4' ${REF} $(filter %.gz,$^) |\
    ${samtools.exe} view -Sbu  - |\
    ${samtools.exe} sort -O sam -o $@ -T $(basename $@) -

S4_01_R2.fq.gz : S4_01_R1.fq.gz
    gzip --best -f $(basename $@)

S4_01_R1.fq.gz  : ${REF}
    @sleep 1
    ${wgsim.exe} -r 0.01 -S 4 -N 1000  $< $(basename S4_01_R1.fq.gz S4_01_R2.fq.gz)
    gzip --best -f $(basename $@)


$(addsuffix .bwt,${REF}) : ${REF}
    ${bwa.exe} index $<
${REF}:
    echo ">rotavirus" > $@
    echo "GGCTTTTAATGCTTTTCAGTGGTTGCTGCTCAAGATGGAGTCTACTCAGCAGATGGTAAGCTCTATTATT" >> $@
    echo "AATACTTCTTTTGAAGCTGCAGTTGTTGCTGCTACTTCAACATTAGAATTAATGGGTATTCAATATGATT" >> $@
    echo "ACAATGAAGTATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAA" >> $@
    echo "TCTTTTGGGTAAAGCTATAACTATTGATCAGGCGTTAAATGGAAAGTTTAGCTCAGCTATTAGAAATAGA" >> $@
    echo "AATTGGATGACTGATTCTAAAACGGTTGCTAAATTAGATGAAGACGTGAATAAACTTAGAATGACTTTAT" >> $@
    echo "CTTCTAAAGGGATCGACCAAAAGATGAGAGTACTTAATGCTTGTTTTAGTGTAAAAAGAATACCAGGAAA" >> $@
    echo "ATCATCATCAATAATTAAATGCACTAGACTTATGAAGGATAAAATAGAACGTGGAGAAGTTGAGGTTGAT" >> $@
    echo "GATTCATATGTTGATGAGAAAATGGAAATTGATACTATTGATTGGAAATCTCGTTATGATCAGTTAGAAA" >> $@
    echo "AAAGATTTGAATCACTAAAACAGAGGGTTAATGAGAAATACAATACTTGGGTACAAAAAGCGAAGAAAGT" >> $@
    echo "AAATGAAAATATGTACTCTCTTCAGAATGTTATCTCACAACAGCAAAACCAAATAGCAGATCTTCAACAA" >> $@
    echo "TATTGTAGTAAATTGGAAGCTGATTTGCAAGGTAAATTTAGTTCATTAGTGTCATCAGTTGAGTGGTATC" >> $@
    echo "TAAGGTCTATGGAATTACCAGATGATGTAAAGAATGACATTGAACAGCAGTTAAATTCAATTGATTTAAT" >> $@
    echo "TAATCCCATTAATGCTATAGATGATATCGAATCGCTGATTAGAAATTTAATTCAAGATTATGACAGAACA" >> $@
    echo "TTTTTAATGTTAAAAGGACTGTTGAAGCAATGCAACTATGAATATGCATATGAGTAGTCATATAATTAAA" >> $@
    echo "AATATTAACCATCTACACATGACCCTCTATGAGCACAATAGTTAAAAGCTAACACTGTCAAAAACCTAAA" >> $@
    echo "TGGCTATAGGGGCGGTTTGTGACC" >> $@
    echo "" >> $@


graph.png:
    make -ndrB -f Makefile | ../makefile2graph/make2graph | dot -Tpng -o$@
ADD COMMENT
0
Entering edit mode

Wow, that looks like a really complete and sophisticated way of doing it - you have the ability here to do much better flow-control than with parallel - allow for some jobs to be dependent on others being done, etc. awesome :)

It is a little complex for my needs however (parallelizing an existing list of commands), plus I don't know the make syntax at all, nor xml all that well. I think it's definitely worth an investment of my time for a better/more productive future... perhaps this weekend I will try and wrap my head around it :) But in the mean time I was hoping for something really simple just to get the jobs done, hehe :) (I know, I know - I have some sunk-cost-fallacy issues about getting these bash scripts running in parallel instead of doing it the proper way ;) )

ADD REPLY
1
Entering edit mode

back: the makefile was generated from a model, but you can write it from scratch.

ADD REPLY
0
Entering edit mode

no, I was wrong to put a XSL+XML example. I'll remove it. back in 5'

ADD REPLY
4
Entering edit mode
8.8 years ago
ole.tange ★ 4.5k

However this will only work if the commands in jobs.sh are one-per-line. Multiline commands like the ones above break it.

That can be mitigated simply by using -d '\n\n'. That way you will split on two newlines instead of one.

Furthermore, if the commands behave differently when their stdout/stderr are being redirected, they will detect that parallel is redirecting and, well, act differently - no status output, refuse to run, etc.

If you do not like redirection -u is for you. But your output may be mixed up.

It seems your commands are extremely similar. I would personally use a template for that - maybe a bash function:

toSam() {
java -jar /ex/picard/picard.jar FastqToSam \
    DESCRIPTION="$1" \
    FASTQ="$2" \
    FASTQ2="$3" \
    OUTPUT="$4" \
    SAMPLE_NAME="$5" \
    LIBRARY_NAME="$6" \
    READ_GROUP_NAME="$7" \
    SEQUENCING_CENTER="Freiburg" \
    PLATFORM="illumina"
}
export -f toSam

parallel --header : toSam {Description} {Fastq} {=2 s/_R1_/_R2_/=} {=2 's/(_L\d\d\d)_.*/$1/' =} {Samplename} {Library} {Readgroup} ::: samplesheet.csv

You need a samplesheet.csv like this:

Description,Fastq,Samplename,Library,Readgroup
Circadian Day,/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L003_R1_001.fastq.gz,C57BL/6J,Mm01.H3K27ac,39V34V1.D265MACXX.3.CAGATC

Before going all in on GNU Parallel use --dryrun to see what will be run. Try a couple of these commands by hand and see that they work as expected.

ADD COMMENT
0
Entering edit mode

Wow Ole the -d and -u flags are exactly what I needed! That's totally solved my parallel issues :D

I will still read up on make this weekend, but for now this really simplifies things for me.

As I mentioned in some of the above comments, personally I like to generate my commands from inside python, where 90% of my code is checks rather than command-creation itself. For example, samplesheet.csv is itself going to have to be generated from the data right? (I'm certainly not going to fill out a row for all 1000 sample files by hand :P Not without making a few hundred typos.)

Which is why using parallel for script generation seems so funny to me - we use generated data to generate commands, when we could have just used whatever generated the data to just generate the commands directly :)

ADD REPLY
1
Entering edit mode

For example, sampelsheet.csv is itself going to have to be generated from the data right?

What does your original data look like? Maybe that can be used directly instead?

we use generated data to generate commands, when we could have just used whatever generated the data to just generate the commands directly

GNU Parallel tries to be powerful enough to simply take your original data and do the command generation. In your case it may be so complex that your solution is simpler - in which case: Go for it.

ADD REPLY
2
Entering edit mode
8.8 years ago

I also think that Pierre's answer is great but a little too sophisticated (unless I'm overlooking something of course)! Here's a simpler way to run the same command(s) for different input/output files.

For example, loop through all the fastq file pairs in current dir and prepare a bunch of bash scripts, but don't run them yet:

for FQ1 in *_R1_001.fastq.gz
do
    FQ2=${FQ1/_R1_/_R2_} ## Get fastq mate 2
    OUTBAM=`basename $FQ1 _R1_001.fastq.gz`.bam ## Prepare output filename
    echo "
        java -jar /ex/picard/picard.jar FastqToSam \
        DESCRIPTION="Circadian Day" \
        FASTQ=${FQ1} \
        FASTQ2=${FQ2} \
        OUTPUT=/sequence/${OUTBAM} \
        SAMPLE_NAME=C57BL/6J \
        LIBRARY_NAME=Mm01.H3K27ac \
        READ_GROUP_NAME=39V34V1.D265MACXX.3.CAGATC \
        SEQUENCING_CENTER=Freiburg \
        PLATFORM=illumina" > ${FQ1}.tmp.sh
done

You could add some error checking in the scripts, for example to check both fastq 1 and 2 exists.

Now run the bash scripts in parallel using xargs, run up to 10 jobs in parallel:

ls *.tmp.sh | xargs -n 1 -P 10 bash

This is just the get the idea of course

ADD COMMENT
2
Entering edit mode

The makefile would be:

SHELL=/bin/bash
FQ1=$(shell find -type -name "*_R1_001.fastq.gz"  )
## remove fastq and gz suffixes
NAMES=$(basename $(basename ${FQ1}))

define fastq2sam

$$(addsuffix .sam,$(1)):$$(addsuffix .fastq.gz, $(1) $$(subst _R1_,_R2_,$(1)))
    java -jar /ex/picard/picard.jar FastqToSam \
        FASTQ=$$(word 1,$$^) \
        FASTQ2=$$(word 2,$$^) \
        OUTPUT=$$@ \
        SAMPLE_NAME=$(1) \
        LIBRARY_NAME=$(1) \
        READ_GROUP_NAME=$(1) \
        SEQUENCING_CENTER=Freiburg \
        PLATFORM=illumina

endef

all: $(addsuffix .sam,${NAMES})

$(eval $(foreach S,${NAMES},$(call fastq2sam,${S})))
ADD REPLY
1
Entering edit mode

For example, loop through all the fastq file pairs in current dir and prepare a bunch of bash scripts, but don't run them yet:

You could add some error checking in the scripts, for example to check both fastq 1 and 2 exists.

you are re-inventing make.

ADD REPLY
1
Entering edit mode

you are re-inventing make.

Sure, but this way I don't have to learn and use a different thing (P.S. Learning make is on my todo list).

ADD REPLY
0
Entering edit mode

I agree that make is probably too sophisticated 99% of the time. But then again I would have said the same about C and there are times when that 1% really matters and makes things possible that otherwise were not.

Personally, I like the idea of generating commands in a programming language of your choice. I think it would take me a year to be as good at make as I already am in python. For example, here is the python script that actually generates the command file snippet in the OP: http://paste.ofcode.org/MY7ccbgPz2x2zBAYeDw4Ke (or if that has expired, log.bio "KGM0dr")

I couldn't do that in make. Or parallel. I'm not saying no-one can. I'm just saying I cannot. For that reason, I like the idea of people making there commands-to-be-run in the language of their choice, under some sort of psudeo-syntax that is really really easy to learn, then it just runs anywhere that understands that syntax. make ticks all those boxes - it's just not known by many people :(

I also am executing these files with this python, which is simulating parallel but with multi-line support: http://paste.ofcode.org/hSHma3ZZ6ttXFS4L6NqVjL

It's just buggy as hell, and I can think of 101 ways to improve it - but if I did I would be reinventing wheels on wheels :P

ADD REPLY
2
Entering edit mode
8.8 years ago
pld 5.1k

What about wrapping the commands into functions and then calling those functions with GNU Parallel? I usually use this whenever I have large/complex commands. The added benefit is that it makes managing the workflow much easier since you can incorporate several steps into a function and run them in parallel, without having to call parallel on every single step.

Something like:

samtools_flagstat(){
    to_check=$1
    out_file=$2
    samtools flagstat $to_check > $out_file
}
export -f samtools_flagstat
parallel -a list-of-files.txt samtools_flagstat {} {.}.flagstat

Alternatively, if you can run all commands in a given file at once, just put "&" at the end of every command in your .sh file.

ADD COMMENT
0
Entering edit mode

Hm, this is a really neat trick and it could work very nicely - but also maybe not. It depends if there is another way to encapsulate multi-line commands other than functions...

The problem I have with parallel is that it is often used to generate commands, rather than running a set of predefined commands. This may be good for some use-cases like in your nice and succinct example above - but if you can do some programming then you know how to generate a list of commands using your favourite language anyway, and you can use much more complicated checks and generators than parallel can give you (for example, you could calculate the mean insert size for some PE data in Python, then use that as a parameter for the to-be-run command). I also like keeping my command generation and command execution separate, as having a record for the commands which are to-be-run makes it a lot easier to keep logs, debug problems, re-run after a sudden abort, etc.

So I guess in an ideal world, I would generate a recipe for what work needs to be done (in Python), and writing that recipe might take a day or so because it does checks/etc. Once that recipe is made and written out to a file, it should be runnable on any machine - in parallel on 100 cores, or just 1. But I want to be able to review the work-to-be-run before I fire it off, and I want to it to be as simple as hell because this is 2016 and we all have better things to do that think about how to execute our commands :P

Obviously parallel can do things I didn't know it could do though, so I will look into this more :)

ADD REPLY
0
Entering edit mode

The problem I have with parallel is that it is often used to generate commands, rather than running a set of predefined commands.

Sorry, but whoever uses parallel like this, completely misunderstood the idea of it! Parallel executes arbitrary complex shell commands (even though they have to be in one line) in parallel. There is no need to "generate commands" in parallel. In particular, doing:

cat my-list-of-files.txt | parallel generate_my_command {} | sh

is even more ridiculous than

cat $my_file | grep $my_regex

For start learning to use parallel, have a look at all the examples in the parallel manpage.

However, I partially agree with you, that command/script generation is more convenient using a higher level language but that does not limit the use of parallel! In particular, you can easily do:

script_to_generate_commands.{pl,py,bash,whatsoever} input_file_list.txt | parallel -j $n_parallel_processes

You just have to make sure that script_to_generate_commands.* outputs them in one line (which is fine, because if you pipe directly into parallel, there is no need for newlines - machine-readable wins! :-P )

Another thing: usability and graphical outputs limit speed. In whichever parallel environment you work (be it parallel or a grid/cluster engine), you will likely not see the status while the execution is performed. Even so it is 2016 ;-)

PS: I like make, but the trick with exporting the bash function to parallel is a very handy trick! I like it! :)

ADD REPLY
0
Entering edit mode

It does work nicely, very nicely in fact. I don't use it for single/simple commands unless they won't fit on a single line. Typically it is just for more complicated things. The example I gave above would be better run as:

parallel samtools flatstat {} > {.}.flagstat :::: list_of_files

I use parallel for generating commands, but only for running swarm jobs. I'm not sure I understand the advantages of your approach, or the desire for it.

There's zero difference in what will be run between

parallel -a list_of_files samtools flagstat {} > {.}

and calling parallel sh -c... on a file that contains

samtools flagstat foo.bam > foo.flagstat
samtools flagstat foo2.bam > foo2.flagstat
..
..
..
samtools flagstat foon.bam > foon.flagstat

If anything, the first option is a much clearer presentation if what you were doing and will be much easier for someone (including yourself) to figure out what you were doing later on.

As for debugging, you should be running a test case through the whole process before production to make sure it works or the results are reasonable sensible. I'm not sure how a list of commands helps you with that. I'm not sure how "sudden aborts", which are user error 99.999% of the time, is harder to address with what I've suggested. You just run the command/script again after making any changes to it. Logging is also trivial since you can redirect stdout and/or stderr to be appended to a file.

Again, like I said, the call to parallel implicitly defines the commands to be run. That's the whole point of it.

Parallel also has an option --dry-run, which allows you to view the commands it will run given the arguments you supplied.

As for your python example, the same exact thing applies:

parallel python script.py {} :::: input.txt

It doesn't get easier than GNU Parallel. From what you said about "generating commands" with python, it sounds like you're trying to manage your workflows from within python and then executing them outside of python. This is okay, but in my personal experience is is a pain in the ass because managing processes/shell calls with python is a little clunky and over all it adds more work than it removes. Also your set up is not the most flexible, each python script you write generates a very specific list of commands, if you want to use mean insert (or whatever) for another program, you have to write another script.

I like to chop my workflows up into as reasonably small individual components, and then link them. Think of a directional graph. The nodes are data, this data can be in a file or generated by a program. The edges define which nodes/files require data from where.

Say you need to calculate the mean insert and then use that as a value for some other script. Rather than write a single python script to do this, I would have a python script that just calculates the average insert size given some file and prints it to standard out. Then I would have my second take this as an argument (or even read it from stdin). Something like:

mean_and_stuff(){
    in_bam=$1
    threshold=$2
    out_file=$3

    mean_ins=`calc_mean_insert.py $in_bam`

    if [ $mean_ins -gt $threshold]; then
        echo "Mean insert meets threshold"
        proc_bam.py $mean_ins $in_bam > $out_file
    else
        echo "ERROR: Mean insert does not meet threshold" 1>&2
        exit 1
    fi
}
export -f mean_and_stuff
parallel mean_and_stuff {} 300 {.}.results :::: list_of_files

The nice thing about this approach is I can easily recycle code, I can use calc_mean.py for anything else that would need the mean. You don't have to use BASH for this, I like it because it makes things very easy and clean, you still use this approach with python but it gets cumbersome.

ADD REPLY
0
Entering edit mode

Ah, well, everyone has their way of doing things. Like I said, I personally don't like generating-and-running commands in one step because I think it's bad practice. I much prefer to have some on-disk record of exactly what commands where run and with what parameters. Taking your example of running flagstat, I just tried running it now on my data but I got an "Assertion 'fp' failed." error.

After googling the error I came across this other user who has having the same problem (https://plus.google.com/101693197813186529994/posts/A6bnPCvvVcw)

Karanbir apparently fixed the problem by putting a "-" before the bam file path. However, after looking into my list_of_files, it turns out that it is actually just a banana, so maybe this fix wont work for me...

Oh well. I will just wait until the banana goes away and my list of files comes back, then hopefully it will work.

In the meantime I will try putting more semicolons into the command line to see if that makes a difference :)

ADD REPLY
2
Entering edit mode

You need to quote >.

ADD REPLY
1
Entering edit mode

Lost in the deep?

ADD REPLY
0
Entering edit mode

I guess this was a waste of time.

ADD REPLY
1
Entering edit mode

Oh don't be such a poor sport - a serious reply wouldn't have illustrated my point any better :-)

I made an argument that separation of command-generation and command-execution has its merits, and as wonderful as parallel is for some tasks where it's result is known an infallible, the more logic you put into command-generation the more headaches you're going to have down-stream when you think you're debugging execution but really the error lies in generation. I'm sorry if my point wasn't clear enough, but honestly I'm not trying to waste your time! You taught me a lot - which is why I spent time on a funny reply rather than a lazy quote-you-and-disagree.

ADD REPLY

Login before adding your answer.

Traffic: 1864 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6