Tool:Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them
10
298
Entering edit mode
11.8 years ago
ole.tange ★ 4.5k

Article describing tool (for citations):

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

Author's website for obtaining code:

http://www.gnu.org/software/parallel/

All new computers have multiple cores. Many bioinformatics tools are serial in nature and will therefore not use the multiple cores. However, many bioinformatics tasks (especially within NGS) are extremely parallelizeable:

  • Run the same program on many files
  • Run the same program on every sequence

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

A personal installation does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

EXAMPLE: Replace a for-loop

It is often faster to write a command using GNU Parallel than making a for loop:

for i in *gz; do 
  zcat $i > $(basename $i .gz).unpacked
done

can be written as:

parallel 'zcat {} > {.}.unpacked' ::: *.gz

The added benefit is that the zcats are run in parallel - one per CPU core.

EXAMPLE: Parallelizing BLAT

This will start a blat process for each processor and distribute foo.fa to these in 1 MB blocks:

cat foo.fa | parallel --round-robin --pipe --recstart '>' 'blat -noHead genome.fa stdin >(cat) >&2' >foo.psl

EXAMPLE: Processing interleaved Fastq

FASTQ files have the format:

@M10991:61:000000000-A7EML:1:1101:14011:1001 1:N:0:28
CTCCTAGGTCGGCATGATGGGGGAAGGAGAGCATGGGAAGAAATGAGAGAGTAGCAAGG
+
#8BCCGGGGGFEFECFGGGGGGGGG@;FFGGGEG@FF<EE<@FFC,CEGCCGGFF<FGF

Interleaved FASTQ starts with a line like these:

@HWUSI-EAS100R:6:73:941:1973#0/1
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
@EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1

where '/1' and ' 1:' determines this is read 1.

This will cut big.fq into one chunk per CPU core and pass it on stdin (standard input) to the program fastq-reader:

parallel --pipepart -a big.fq --block -1 --regexp \
       --recend '\n' --recstart '@.*(/1| 1:.*)\n[A-Za-z\n\.~]' \
       fastq-reader

EXAMPLE: Blast on multiple machines

Assume you have a 1 GB fasta file that you want blast, GNU Parallel can then split the fasta file into 100 KB chunks and run 1 jobs per CPU core:

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results

If you have access to the local machine, server1 and server2, GNU Parallel can distribute the jobs to each of the servers. It will automatically detect how many CPU cores are on each of the servers:

cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result

EXAMPLE: Run bigWigToWig for each chromosome

If you have one file per chomosome it is easy to parallelize processing each file. Here we do bigWigToWig for chromosome 1..19 + X Y M. These will run in parallel but only one job per CPU core. The {} will be substituted with arguments following the separator ':::'.

parallel bigWigToWig -chrom=chr{} wgEncodeCrgMapabilityAlign36mer_mm9.bigWig mm9_36mer_chr{}.map ::: {1..19} X Y M

EXAMPLE: Running composed commands

GNU Parallel is not limited to running a single command. It can run a composed command. Here is now you process multiple FASTA files using Biopieces (which uses pipes to communicate):

parallel 'read_fasta -i {} | extract_seq -l 5 | write_fasta -o {.}_trim.fna -x' ::: *.fna

See also: https://github.com/maasha/biopieces/wiki/HowTo#howto-use-biopieces-with-gnu-parallel

EXAMPLE: Running experiments

Experiments often have several parameters where every combination should be tested. Assume we have a program called experiment that takes 3 arguments: --age --sex --chr:

experiment --age 18 --sex M --chr 22

Now we want to run experiment for every combination of ages 1..80, sex M/F, chr 1..22+XY:

parallel experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

To save the output in different files you could do:

parallel experiment --age {1} --sex {2} --chr {3} '>' output.{1}.{2}.{3} ::: {1..80} ::: M F ::: {1..22} X Y

But GNU Parallel can structure the output into directories so you avoid having thousands of output files in a single dir:

parallel --results outputdir experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

This will create files like outputdir/1/80/2/M/3/X/stdout containing the standard output of the job.

If you have many different parameters it may be handy to name them:

parallel --result outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y

Then the output files will be named like outputdir/AGE/80/CHR/Y/SEX/F/stdout

If you want the output in a CSV/TSV-file that you can read into R or LibreOffice Calc, simply point --result to a file ending in .csv/.tsv:

parallel --result output.tsv --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y

It will deal correctly with newlines in the output, so they will be read as newlines in R or LibreOffice Calc.

If one of your parameters take on many different values, these can be read from a file using '::::'

echo AGE > age_file
seq 1 80 >> age_file
parallel --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y

If you have many experiments, it can be useful to see some experiments picked at random. Think of it as painting a picture by numbers: You can start from the top corner, or you can paint bits at random. If you paint bits at random, you will often see a pattern earlier, than if you painted in the structured way.

With --shuf GNU Parallel will shuffle the experiments and run them all, but in random order:

parallel --shuf --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y

EXAMPLE(advanced): Using GNU Parallel to parallelize you own scripts

Assume you have BASH/Perl/Python script called launch. It takes one arguments, ID:

launch ID

Using parallel you can run multiple IDs in parallel using:

parallel launch ::: ID1 ID2 ...

But you would like to hide this complexity from the user, so the user only has to do:

launch ID1 ID2 ...

You can do that using --shebang-wrap. Change the shebang line from:

#!/usr/bin/env bash
#!/usr/bin/env perl
#!/usr/bin/env python

to:

#!/usr/bin/parallel --shebang-wrap bash
#!/usr/bin/parallel --shebang-wrap perl
#!/usr/bin/parallel --shebang-wrap python

You further develop your script so it now takes an ID and a DIR:

launch ID DIR

You would like it to take multiple IDs but only one DIR, and run the IDs in parallel. Again just change the shebang line to:

#!/usr/bin/parallel --shebang-wrap bash

And now you can run:

launch ID1 ID2 ID3 ::: DIR

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial once a year - your command line will love you for it: http://www.gnu.org/software/parallel/parallel_tutorial.html

Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

#ilovefs

If you like GNU Parallel:

  • Give a demo at your local user group/team/colleagues (remember to show them --bibtex)
  • Post the intro videos on Reddit/Diaspora*/forums/blogs/ Identi.ca/Google+/Twitter/Facebook/Linkedin/mailing lists
  • Get the merchandise https://www.gnu.org/s/parallel/merchandise.html
  • Request or write a review for your favourite blog or magazine
  • Request or build a package for your favourite distribution (if it is not already there)
  • Invite me for your next conference

When using programs that use GNU Parallel to process data for publication you should cite as per parallel --citation. If you prefer not to cite, contact me.

If GNU Parallel saves you money:

parallel next-gen ngs • 147k views
ADD COMMENT
4
Entering edit mode

Excellent examples. I've been using GNU Parallel for a while, but I learned a lot by reading this. Thanks for posting, and for the videos (those really helped me get off the ground with Parallel).

ADD REPLY
2
Entering edit mode

This is very very useful. Thanks for the concrete examples. BTW, about zcat, a multithreaded version of gzip exists, it is called "pigz" ;-)

ADD REPLY
0
Entering edit mode

pigz (http://zlib.net/pigz/) is simple and can save a very significant amount of time if you have a lot of threads available.

ADD REPLY
2
Entering edit mode

First, GNU parallel has the best installability of everything I have seen in my live. And all examples worked - gunzips, blast and blasts etc.

However, I got stuck on one thing.

I have a simple perl script called transparal.pl. It makes something to a file that is provided as argument. And the original one behaves as expected after chmod 700

$ ./transparal_old.pl 
Could not open file! at ./transparal_old.pl line 5. ( did not give the input file name!)

Then I changed the shebang to

#!/usr/bin/parallel --shebang-wrap perl

and ...

$ ./transparal.pl 
-bash: ./transparal.pl: /usr/bin/parallel: bad interpreter: No such file or directory

checking GNU parallel

$ which parallel
/usr/local/bin/parallel

Looks ok. Confused.

ADD REPLY
2
Entering edit mode

look at #!/usr/bin/parallel vs. where it really is: #!/usr/local/bin/parallel; could also try this instead #!/usr/bin/env parallel, then it will just take parallel from your PATH, that's possibly the (mostly) portable way. However, I am not sure if the /usr/bin/env way handles parameters to the program, edit: see http://stackoverflow.com/questions/4303128/how-to-use-multiple-arguments-with-a-shebang-i-e which in conclusion means you can on many systems use only the correct absolute path (and #!/usr/bin/env parallel --argstothe program most likely will not work).

ADD REPLY
0
Entering edit mode

I wonder if recent builds of the binary expect it to be in /usr/bin. Still trying to troubleshoot a similar problem.

ADD REPLY
1
Entering edit mode

I have built my own parallel and installed it in $HOME/bin which is in my PATH, worked fine for me.

ADD REPLY
1
Entering edit mode

Great. On a cluster, does one need to acquire/assign the cores to be used using MPI/smp first or can just run parallel without it.

ADD REPLY
1
Entering edit mode

That would depend on the rules for your cluster. Default for GNU Parallel is to spawn one process per cpu core.

ADD REPLY
0
Entering edit mode

Its Rocks running JAVA SGE, I will test and see. Cheers

ADD REPLY
1
Entering edit mode

@ole.tange maybe you could briefly explain why parallel is superior to a for loop - aside from the shorter syntax.

ADD REPLY
2
Entering edit mode

The for loop executes the commands one at a time. Parallel can use multiple processors to run them in parallel.

ADD REPLY
2
Entering edit mode

A for loop can start many jobs simultaneously by putting them in the background -> for i in *; do cat $i & done; - but that way you may start 1000 jobs which is probably inefficient. Parallel does some clever load balancing.

ADD REPLY
1
Entering edit mode

What about dealing with the many bioinformatics tools that are do not accept streams as input and insist on reading files instead? (e.g. blat I think). Is there an easy way to autogenerate and delete such files within a single line of "parallel"?

ADD REPLY
6
Entering edit mode

You can use named pipes to stream data to placeholder files, which can be used with some tools that do not read streams: http://en.wikipedia.org/wiki/Named_pipe

ADD REPLY
0
Entering edit mode

Wow - very cool!

ADD REPLY
0
Entering edit mode

amazing, upvote

ADD REPLY
2
Entering edit mode

In these cases, what I do is to write a wrapper script, which generates any parameter file needed for running the script.

ADD REPLY
2
Entering edit mode

One solution is to create a file: cat file | parallel --pipe "cat >{#}; my_program {#}; rm {#}". Alex suggests using named pipes - which is more efficient, but does not work with every tool: cat file | parallel --pipe "mkfifo {#}; my_program {#} & cat >{#};rm {#}"

ADD REPLY
0
Entering edit mode

Hey Ole, how to bypass awk quotes, example (counting the reads in fastq files)

parallel 'echo && gunzip -c | wc -l | awk \'{print $1/4}\'' ::: *fastq.gz wont work

ADD REPLY
4
Entering edit mode

Hi Sukhdeep, the following worked for me: parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: *.gz

ADD REPLY
0
Entering edit mode

Super Alex, it works :)

ADD REPLY
1
Entering edit mode

In the past I've successfully convinced some tools (e.g. GATK HaplotypeCaller) to accept /dev/stdin & /dev/stdout as input and output files, respectively. Give it a try ;)

ADD REPLY
0
Entering edit mode

Use --fifo:

cat file | parallel --fifo --pipe wc {}

Or --cat:

cat file | parallel --cat --pipe wc {}
ADD REPLY
1
Entering edit mode

How do we put wait command between different parallel runs ? I have a script that performs multiple jobs in order.

parallel jobs 1
wait
parallel jobs 2
...etc

Will that work?

ADD REPLY
0
Entering edit mode

why do you need to call wait ?

ADD REPLY
0
Entering edit mode

I am thinking that everything starts parallelly. I have to wait until the Jobs 1 finishes and then start jobs 2.

ADD REPLY
0
Entering edit mode

use GNU make with option -j

ADD REPLY
0
Entering edit mode

Maybe explore GNU Parallel's semaphore options.

(Though a make-based process as Pierre suggests, or dedicated job scheduler is probably going to be easier to maintain.)

ADD REPLY
0
Entering edit mode

It will work, but there is really no need to call wait. GNU Parallel does that automatically. Try:

parallel 'sleep {};echo Jobslot {%} slept {} seconds' ::: 4 3 2 1
seq 5 -.1 0 | parallel 'sleep {};echo Jobslot {%} slept {} seconds'
seq 5 -.1 0 | parallel -j0 'sleep {};echo Jobslot {%} slept {} seconds'

You will see that GNU Parallel only finishes after the last job is done.

ADD REPLY
1
Entering edit mode

What is the best way to deal with the error below?

parallel "do something" ::: seq.*
-bash: /usr/local/bin/parallel: Argument list too long

Alternatively, if somebody could show me how to pipe to a command defined in a bash script, that would be just wonderful. Right now, I'm doing:

#split the multifasta into individual seqs
cat $NAME/file.fna | parallel --recstart '>' -N1 --pipe "cat - > $NAME/seq.{#}"
#do stuff with the split files
export -f blastFunction
parallel blastFunction ::: $NAME/seq.*

and blastFunction begins like this:

blastFunction() {
        BLAST=$(blastn -query $1 -subject $1 -outfmt 6 -perc_identity 100)
ADD REPLY
2
Entering edit mode

Try giving a text file which lists the input files to parallel instead of direct arguments. You can do this via 4 colons (::::)

Get a list of files by:

ls *.txt > myFiles

Then do:

parallel "do something" :::: myFiles
ADD REPLY
0
Entering edit mode

Thanks for the suggestion, but

-bash: /bin/ls: Argument list too long
ADD REPLY
0
Entering edit mode

yes, if the argument list is too long for bash, it won't work with any command, where you let the shell glob the file names.

ADD REPLY
1
Entering edit mode

Example "EXAMPLE: convert all PED files in a directory to BED" should work for this using find, you seem to have too many txt files

ADD REPLY
1
Entering edit mode

Thanks, this one worked

find $NAME/ -type f -maxdepth 1 -iname "seq.*" | parallel blastFunction

Of course optimal solution would be if I could skip the part of creating thousands and thousands of files..

ADD REPLY
0
Entering edit mode

you might try the >> operator next time you create files

ADD REPLY
0
Entering edit mode

Use Bash's builtin printf. Because it is builtin it is not subject to the same limit:

printf "%s\0" seq.* | parallel -0 do something
ADD REPLY
1
Entering edit mode

Hi All, I need some help to run my two samples to run freebayes using parallel. I saw this in the previous post but got confused:

parallel --keep-order --max-procs 11 "freebayes --fasta-reference $REF \
    --genotype-qualities --experimental-gls \
    --region {} $BAM  " ::: $seqnames \
    | vcffirstheader \
    | vt normalize -r $REF - > $VCF

I am a bit confused I have two BAM files that need to be run with freebayes in one case I dont want to use vcffirstheader and vt normalise and in the second run I want to

Lets say the files are S1.bam and S2.bam and the reference is hg38.fa. Also do I need

--region {} $BAM

?

parallel --keep-order --max-procs 0 "freebayes --fasta-reference hg38.fa " ::: S1.bam S2.bam > output_1.vcf

with the vcffirst header and vt normalise would it look like:

parallel --keep-order --max-procs 11 "freebayes --fasta-reference hg38.fa " ::: S1.bam S2.bam | vcffirstheader | vt normalize -r hg38.fa - > output_2.vcf

Can someone please help me?

Thanks

ADD REPLY
1
Entering edit mode

Hi I have a burning question,

I want to run a script named "predict_binding.py". Its syntax is:

./predict_binding.py [argA] [argB] [argC] ./file.txt

file.txt has a column of strings with the same length:

string_1 
string_2 
string_3
...
string_n

predict_binding.py works with the first 3 arguments and string_1, then the 3 arguments and string_2, and so on.

That's fine, but now I have m argB, and I want to test all of them. I want to use the cluster for this, and this looks like a perfect job for parallel, isn't it?

After reading the manual and spending hours to try to make it work I realised I need some help.

What works so far (and is trivial) is:

parallel --verbose ./predict_binding ::: argA ::: argBi ::: argC ::: ./file.txt

This gives the same result as:

./predict_binding.py argA argBi argC ./file.txt

And indeed the flag --verbose says that the command looks like

./predict_binidng.py argA argBi argC ./file.txt

but I want to test all arg2, so I made a file called args.txt, which looks like this:

argA argB1 argC ./file.txt
argA argB2 argC ./file.txt
...
argA argBm argC ./file.txt

If I do:

cat args.txt | parallel --verbose ./predict_binding.py {}

I get an error from ./predict_binding saying:

predict_binding.py: error: incorrect number of arguments

And verbose says that the command looks like: ./predict_binding.py argA\ argBi\ argC\ ./file.txt

So, maybe those backslashes are affecting the input of ./predict_binding? How could I avoid them?

I have tried using double and single quotations " ', backslash \, backslash with single quote \', none has work!

I also tried:

cat ./args.txt | parallel --verbose echo | ./predict_binding

Same error as above.

And also I tried to use a function like:

binding_func ( ) { ./predict_binding argA $1 argC ./file.txt}

Interestingly, binding_func works for:

parallel binding_func ::: argB1

But if I do:

parallel binding_func ::: argB1 argB2

It gives the result for one arg but fails (same error as above) for the other.

If I put only argB1 in the args.txt file and do:

cat args.txt | parallel --verbose binding_func {}

It fails miserably with the same error: predict_binding.py: error: incorrect number of arguments

It seems a very trivial and easy problem but I haven't been able to solve it }:(

I would appreciate very much any help provided. :)

ADD REPLY
1
Entering edit mode

parallel ./predict_binding.py argA argB argC :::: ./file.txt

ADD REPLY
0
Entering edit mode

This is just wonderful. As you have mentioned above GNU Parallel to parallelize you own scripts which can be bash/python/perl etc which can take multiple IDs (i.e, arguments) at a single go. Does it do the other way so? which taking a single argument and run it in multiple cores of the computer???

ADD REPLY
2
Entering edit mode

How would you run a single argument on multiple cores?

ADD REPLY
0
Entering edit mode

What options are available if you want to utilize other machines but they require a password for ssh? Is there a way to force using rsh?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Set up RSA keys for password-less SSH.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thank you so much!!! This turned a 5.5 hour blast+ job into 25 minutes!

ADD REPLY
0
Entering edit mode

Hello ole.tange

In case of blast, I was wondering what is the difference between using -num_threads and using parallel because when when I use parallel and do top it shows all processes are blast but cpu% is at 99-100 while I use -num_threads it shows only one process is blast but the cpu% is 5900. (I have 60 cores in the server)

I am having confusion in comprehending the two ideas !!!

ADD REPLY
0
Entering edit mode

You should use which ever works faster for you.

ADD REPLY
0
Entering edit mode

I am creating a single .fastq.gz file from many .fastq.gz files with the following command

zcat 15_S15*.fastq.gz | gzip -c > combined_file.fastq.gz

Now, I want to do it with parallel command.

Anyone help me

ADD REPLY
1
Entering edit mode

furthermore: you don't need zcat |gzip ; see How To Merge Two Fastq.Gz Files?

ADD REPLY
0
Entering edit mode

please ask this as a new question,

ADD REPLY
19
Entering edit mode
11.1 years ago

I put my notebook about GNU parallel on figshare:

http://figshare.com/articles/GNU_parallel_for_Bioinformatics_my_notebook/822138

My document follows Ole Tange’s GNU parallel tutorial ( http://www.gnu.org/software/parallel/parallel_tutorial.html ) but I tried to use some bioinformatics-related examples (align with BWA, Samtools, etc.. ).

ADD COMMENT
2
Entering edit mode

Thanks so much for sharing this, it's really useful. I notice that you don't include an example for sorting bam files using parallel. I'm trying that now:

sort_index_bam(){
    outfile=`echo $1 | sed -e 's/.bam/_sorted/g'`
    samtools sort $1 $outfile
    index_file="$outfile.bam"
    samtools index $outfile
}
export -f sort_index_bam
parallel -j10 --progress --xapply sort_index_bam ::: `ls -1 *.bam`

And get the error (for example)

[E::hts_open] fail to open file 'HiC_LY1_1_NoIndex_L003_034_sorted.bam'
local:10/33/100%/42.8s [bam_sort_core] merging from 3 files...

Perhaps it's something to do with how parallel schedules its worker threads? The same scripted commands work fine on the command line in serial. I'm wondering if you have tried something similar.

ADD REPLY
0
Entering edit mode

Try this (requires parallel 20140822 or later):

sort_index_bam(){
    samtools sort "$1" "$2"
    samtools index "$2"
}
export -f sort_index_bam
parallel --bar sort_index_bam {} '{=s/.bam/_sorted.bam/=}' ::: *.bam
ADD REPLY
1
Entering edit mode

This is awesome!

ADD REPLY
1
Entering edit mode

Thank you Pierre! Do you think it would be possible to have a version with bigger fonts? At 125%, it is difficult to read.

ADD REPLY
9
Entering edit mode
11.8 years ago
lh3 33k

All the clusters I use require to use SGE/LSF. My understanding is that parallel does not support SGE/LSF (correct me if I am wrong). I would recommend a general way to compose multiple command lines as:

seq 22 | xargs -i echo samtools cmd -r chr{} aln.bam | parallel -j 5
ls *.bed | sed s,.bed,, | xargs -i echo mv {}.bed {}.gff | sh
ls *.psmcfa | sed s,.psmcfa,, | xargs -i echo psmc {}.psmcfa \> {}.psmc \& | sh
ls *.psmcfa | sed s,.psmcfa,, | xargs -i echo bsub 'psmc -o {}.psmc {}.psmcfa' | sh

For the last command line to submit jobs to LSF, I more often use my asub script:

ls *.psmcfa | sed s,.psmcfa,, | xargs -i echo psmc -o {}.psmc {}.psmcfa | asub

You can see from the above examples that the general pattern is to feed a list to xargs -i echo to let it print the commands to stdout. The last command after pipe | can be sh if you know they run fast, parallel if you want to control the number of concurrent jobs on the same machine, or asub if you want to submit to LSF/SGE. There are also a few variants: e.g. put & or bsub in the echo command. With xargs -i echo, you will not be bound to the parallel grammar. Another advantage is for complex command lines, you can pipe it to more to visually check if the commands are correct. At least for me, I frequently see problems from more before submitting thousands of jobs.

ADD COMMENT
2
Entering edit mode

Here are your examples using GNU Parallel:

seq 22 | parallel -j5 samtools cmd -r chr{} aln.bam
ls *.bed | parallel mv {} {.}.gff
ls *.psmcfa | parallel psmc {} \> {.}.psmc
ls *.psmcfa | parallel bsub psmc -o {.}.psmc {}
ls *.psmcfa | parallel echo psmc -o {.}.psmc {} | asub

It is shorter and IMHO easier to read.

You can use --dry-run if you want to see what would be done.

ls *.psmcfa | parallel --dry-run psmc {} \> {.}.psmc
ADD REPLY
1
Entering edit mode

I have know the basic of parallel for some time. My problem is it is not a standard tool. None of the machines I use in multiple institutes have that. Sed/awk/perl/xargs exist in every Unix distribution. My point is to learn to construct command lines in general. There may be more complicated cases you cannot do with one parallel.

ADD REPLY
4
Entering edit mode

As long as you are allowed to run your own scripts you can run GNU Parallel. 10 seconds installation: 'wget -O - pi.dk/3 | bash'. Read Minimal installation in http://git.savannah.gnu.org/cgit/parallel.git/tree/README

The examples you provide deal badly with spaces in the filenames. Using GNU Parallel this is no longer an issue.

ADD REPLY
2
Entering edit mode

These are both good points. There are good arguments to make for both cases. In one hand, you don't want to clutter your pipelines with tools and binaries that are not minimally standard. On the other hand, good tools will eventually become a standard (this tool may be an example). Somewhat related, I think this project (shameless plug): https://github.com/drio/bio.brew can help mitigate the management of software tools. Specially useful when you don't have root access to the boxes where you do your analysis.

ADD REPLY
6
Entering edit mode
11.8 years ago

EXAMPLE: Using multiple SSH-capable hosts to efficiently generate a highly-compressed BED archive

For labs without an SGE installation but lots of quiet hosts running an SSH service and BEDOPS tools, we can use GNU Parallel to quickly generate per-chromosome, highly-compressed archives of an input BED file, stitching them together at the end into one complete Starch archive.

This archival process can reduce a four-column BED file to about 5% of its original size, while preserving the ability to do memory-efficient, high-performance and powerful multi-set and statistical operations with bedops, bedmap, bedextract and closest-features:

$ PARALLEL_HOSTS=foo,bar,baz
$ bedextract --list-chr input.bed \
    | parallel \
        --sshlogin $PARALLEL_HOSTS \
        "bedextract {} input.bed | starch - > input.{}.starch"
$ starchcat input.*.starch > input.starch
$ rm input.*.starch

Once the archive is made, it can be operated on directly, just like a BED file, e.g.:

$ bedops --chrom chrN --element-of -1 input.starch another.bed
...

We have posted a GNU Parallel-based variant of our starchcluster script, part of the BEDOPS toolkit, which uses some of the above code to facilitate using multiple hosts to efficiently parallelize the process of making highly-compressed and operable Starch archives out of BED inputs.

ADD COMMENT
1
Entering edit mode

I am trying to figure out why you need --max-lines 1 --keep-order. Would it not work just as well without? Also why not use GNU Parallel's automation to figure out the number of cores instead of forcing $MAX_JOBS in parallel? (and you miss a | before parallel)

ADD REPLY
1
Entering edit mode

You're right - assuming that defaults do not change, then it isn't necessary to be explicit. Thanks for the pipe catch.

ADD REPLY
0
Entering edit mode

Is it possible to make parallel use rsh in the case where ssh requires a password?

ADD REPLY
2
Entering edit mode

Not sure, but you can certainly use RSA keys to SSH across hosts without a password. See: http://archive.news.softpedia.com/news/How-to-Use-RSA-Key-for-SSH-Authentication-38599.shtml

ADD REPLY
0
Entering edit mode

Yes. Normally a --sshlogin is simply a host name:

server1

If you want to use another "ssh-command" than ssh, then you prepend with full path to the command:

/usr/bin/rsh server1

Making it look like this:

parallel -S "/usr/bin/rsh server1" do_stuff

But if possible it is a much better idea to use SSH with SSH-agent to avoid typing the passwords: https://wiki.dna.ku.dk/dokuwiki/doku.php?id=ssh_config

ADD REPLY
4
Entering edit mode
10.3 years ago
brentp 24k

Practical variant calling example

get sequence names from FASTA to parallelize by chromosome--not perfect, but works well in practice:

seqnames=$(grep ">" $REF | awk '{ print substr($1, 2, length($1)) }')

run samtools

parallel --keep-order --max-procs 11 "samtools mpileup -Euf $REF -r {} $BAM \
   | bcftools view -v -" ::: $seqnames \
   | vcffirstheader \
   | vt normalize -r $REF - > $VCF

where vcffirstheader is from vcflib and vt normalize is from https://github.com/atks/vt

Same for freebayes:

parallel --keep-order --max-procs 11 "freebayes --fasta-reference $REF \
    --genotype-qualities --experimental-gls \
    --region {} $BAM  " ::: $seqnames \
    | vcffirstheader \
    | vt normalize -r $REF - > $VCF
ADD COMMENT
3
Entering edit mode

When your parallel command spans multiple screen lines it is time to consider using a bash function instead:

my_freebayes() {
  freebayes --fasta-reference $REF --genotype-qualities --experimental-gls --region "$1" $BAM
}
export -f my_freebayes

parallel --keep-order --max-procs 11 my_freebayes ::: $seqnames \
    | vcffirstheader \
    | vt normalize -r $REF - > $VCF

But it is purely a matter of taste.

ADD REPLY
3
Entering edit mode
11.8 years ago

EXAMPLE: Coalescent simulations using COSI

This script can be used to launch multiple cosi simulations using the GNU/parallel tool, or a Sun Grid Engine environment.

This is how I launch it:

$: seq 1 100 | parallel ./launch_single_cosi_iteration.sh {} outputfolder

This is an home made script that I wrote to execute a one time task. I've tried to adapt it for a general case and to improve the documentation, but I didn't spent much time on it. Please ask me if it doesn't work or if you have any doubt.


EXAMPLE: convert all PED files in a directory to BED

This is an example of how GNU/parallel can be used in combination with plink (or vcftools) to execute tasks on a set of data files. Note how {.} takes the value of the file name without the extension.

find . -type f -maxdepth 1 -iname "*ped" | parallel "plink --make-bed --noweb --file {.} --out {.}"
ADD COMMENT
2
Entering edit mode

Since you do not use UNIX special chars (such as | * > &) in your command the " are not needed.

ADD REPLY
0
Entering edit mode

thank you, I didn't know that :-)

ADD REPLY
3
Entering edit mode
ADD COMMENT
2
Entering edit mode
5.7 years ago
ole.tange ★ 4.5k

GNU Parallel now has a cheat sheet:

https://www.gnu.org/software/parallel/parallel_cheat.pdf

ADD COMMENT
1
Entering edit mode
4.0 years ago
ole.tange ★ 4.5k

EXAMPLE: grouping of lines

GNU Parallel > 20190522 can split piped input into chunks based on the value of a given field.

You have input as:

sampleID,chr1, ...
sampleID,chr1, ...
:
sampleID,chr1, ...
sampleID,chr2, ...
:
sampleID,chr2, ...
sampleID,chr3, ...

You have a program that reads lines for one chromosome (process_chr), so you want the input to be chopped into chunks based on the value in column 2:

cat file | parallel --group-by 2 --colsep , -N1 --pipe process_chr

If process_chr reads 1 or more chromosomes:

cat file | parallel --group-by 2 --colsep , --pipe process_chr
ADD COMMENT
0
Entering edit mode
5.0 years ago
ole.tange ★ 4.5k

EXAMPLE: Call program with FASTA sequence

FASTA files have the format:

>Sequence name1                                                                                                     
sequence                                                                                                            
sequence continued                                                                                                  
>Sequence name2                                                                                                     
sequence                                                                                                            
sequence continued                                                                                                  
more sequence

To call myprog with the sequence as argument run:

cat file.fasta |                                                                                                    
  parallel --pipe -N1 --recstart '>' --rrs \                                                                        
    'read a; echo Name: "$a"; myprog $(tr -d "\n")'
ADD COMMENT
0
Entering edit mode
19 months ago
ole.tange ★ 4.5k

EXAMPLE: Call program on paired-end files

You have files from paired end sequencing:

foo_1.fastq.gz, foo_2.fastq.gz, bar_1.fastq.gz, bar_2.fastq.gz

in /some/dir

You want to run:

myprg /some/dir/foo_1.fastq.gz /some/dir/foo_2.fastq.gz foo_1.out
myprg /some/dir/bar_1.fastq.gz /some/dir/bar_2.fastq.gz bar_1.out

Run:

parallel --plus myprg {} {/_1.fastq/_2.fastq} {/..}.out ::: /some/dir/*_1.fastq.gz
ADD COMMENT

Login before adding your answer.

Traffic: 1938 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6