Article describing tool (for citations):
O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.
Author's website for obtaining code:
http://www.gnu.org/software/parallel/
All new computers have multiple cores. Many bioinformatics tools are serial in nature and will therefore not use the multiple cores. However, many bioinformatics tasks (especially within NGS) are extremely parallelizeable:
- Run the same program on many files
- Run the same program on every sequence
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
A personal installation does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
EXAMPLE: Replace a for-loop
It is often faster to write a command using GNU Parallel than making a for
loop:
for i in *gz; do
zcat $i > $(basename $i .gz).unpacked
done
can be written as:
parallel 'zcat {} > {.}.unpacked' ::: *.gz
The added benefit is that the zcat
s are run in parallel - one per CPU core.
EXAMPLE: Parallelizing BLAT
This will start a blat process for each processor and distribute foo.fa to these in 1 MB blocks:
cat foo.fa | parallel --round-robin --pipe --recstart '>' 'blat -noHead genome.fa stdin >(cat) >&2' >foo.psl
EXAMPLE: Processing interleaved Fastq
FASTQ files have the format:
@M10991:61:000000000-A7EML:1:1101:14011:1001 1:N:0:28
CTCCTAGGTCGGCATGATGGGGGAAGGAGAGCATGGGAAGAAATGAGAGAGTAGCAAGG
+
#8BCCGGGGGFEFECFGGGGGGGGG@;FFGGGEG@FF<EE<@FFC,CEGCCGGFF<FGF
Interleaved FASTQ starts with a line like these:
@HWUSI-EAS100R:6:73:941:1973#0/1
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
@EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1
where '/1' and ' 1:' determines this is read 1.
This will cut big.fq into one chunk per CPU core and pass it on stdin (standard input) to the program fastq-reader:
parallel --pipepart -a big.fq --block -1 --regexp \
--recend '\n' --recstart '@.*(/1| 1:.*)\n[A-Za-z\n\.~]' \
fastq-reader
EXAMPLE: Blast on multiple machines
Assume you have a 1 GB fasta file that you want blast, GNU Parallel can then split the fasta file into 100 KB chunks and run 1 jobs per CPU core:
cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results
If you have access to the local machine, server1 and server2, GNU Parallel can distribute the jobs to each of the servers. It will automatically detect how many CPU cores are on each of the servers:
cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result
EXAMPLE: Run bigWigToWig for each chromosome
If you have one file per chomosome it is easy to parallelize processing each file. Here we do bigWigToWig for chromosome 1..19 + X Y M. These will run in parallel but only one job per CPU core. The {} will be substituted with arguments following the separator ':::'.
parallel bigWigToWig -chrom=chr{} wgEncodeCrgMapabilityAlign36mer_mm9.bigWig mm9_36mer_chr{}.map ::: {1..19} X Y M
EXAMPLE: Running composed commands
GNU Parallel is not limited to running a single command. It can run a composed command. Here is now you process multiple FASTA files using Biopieces (which uses pipes to communicate):
parallel 'read_fasta -i {} | extract_seq -l 5 | write_fasta -o {.}_trim.fna -x' ::: *.fna
See also: https://github.com/maasha/biopieces/wiki/HowTo#howto-use-biopieces-with-gnu-parallel
EXAMPLE: Running experiments
Experiments often have several parameters where every combination should be tested. Assume we have a program called experiment
that takes 3 arguments: --age --sex --chr:
experiment --age 18 --sex M --chr 22
Now we want to run experiment
for every combination of ages 1..80, sex M/F, chr 1..22+XY:
parallel experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y
To save the output in different files you could do:
parallel experiment --age {1} --sex {2} --chr {3} '>' output.{1}.{2}.{3} ::: {1..80} ::: M F ::: {1..22} X Y
But GNU Parallel can structure the output into directories so you avoid having thousands of output files in a single dir:
parallel --results outputdir experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y
This will create files like outputdir/1/80/2/M/3/X/stdout containing the standard output of the job.
If you have many different parameters it may be handy to name them:
parallel --result outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y
Then the output files will be named like outputdir/AGE/80/CHR/Y/SEX/F/stdout
If you want the output in a CSV/TSV-file that you can read into R or LibreOffice Calc, simply point --result to a file ending in .csv/.tsv:
parallel --result output.tsv --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y
It will deal correctly with newlines in the output, so they will be read as newlines in R or LibreOffice Calc.
If one of your parameters take on many different values, these can be read from a file using '::::'
echo AGE > age_file
seq 1 80 >> age_file
parallel --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y
If you have many experiments, it can be useful to see some experiments picked at random. Think of it as painting a picture by numbers: You can start from the top corner, or you can paint bits at random. If you paint bits at random, you will often see a pattern earlier, than if you painted in the structured way.
With --shuf
GNU Parallel will shuffle the experiments and run them all, but in random order:
parallel --shuf --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y
EXAMPLE(advanced): Using GNU Parallel to parallelize you own scripts
Assume you have BASH/Perl/Python script called launch
. It takes one arguments, ID:
launch ID
Using parallel you can run multiple IDs in parallel using:
parallel launch ::: ID1 ID2 ...
But you would like to hide this complexity from the user, so the user only has to do:
launch ID1 ID2 ...
You can do that using --shebang-wrap. Change the shebang line from:
#!/usr/bin/env bash
#!/usr/bin/env perl
#!/usr/bin/env python
to:
#!/usr/bin/parallel --shebang-wrap bash
#!/usr/bin/parallel --shebang-wrap perl
#!/usr/bin/parallel --shebang-wrap python
You further develop your script so it now takes an ID and a DIR:
launch ID DIR
You would like it to take multiple IDs but only one DIR, and run the IDs in parallel. Again just change the shebang line to:
#!/usr/bin/parallel --shebang-wrap bash
And now you can run:
launch ID1 ID2 ID3 ::: DIR
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial once a year - your command line will love you for it: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
#ilovefs
If you like GNU Parallel:
- Give a demo at your local user group/team/colleagues (remember to show them --bibtex)
- Post the intro videos on Reddit/Diaspora*/forums/blogs/ Identi.ca/Google+/Twitter/Facebook/Linkedin/mailing lists
- Get the merchandise https://www.gnu.org/s/parallel/merchandise.html
- Request or write a review for your favourite blog or magazine
- Request or build a package for your favourite distribution (if it is not already there)
- Invite me for your next conference
When using programs that use GNU Parallel to process data for publication you should cite as per parallel --citation
. If you prefer not to cite, contact me.
If GNU Parallel saves you money:
- (Have your company) donate to FSF https://my.fsf.org/donate/
Excellent examples. I've been using GNU Parallel for a while, but I learned a lot by reading this. Thanks for posting, and for the videos (those really helped me get off the ground with Parallel).
This is very very useful. Thanks for the concrete examples. BTW, about zcat, a multithreaded version of gzip exists, it is called "pigz" ;-)
pigz (http://zlib.net/pigz/) is simple and can save a very significant amount of time if you have a lot of threads available.
First, GNU parallel has the best installability of everything I have seen in my live. And all examples worked - gunzips, blast and blasts etc.
However, I got stuck on one thing.
I have a simple perl script called
transparal.pl
. It makes something to a file that is provided as argument. And the original one behaves as expected afterchmod 700
Then I changed the shebang to
and ...
checking GNU parallel
Looks ok. Confused.
look at
#!/usr/bin/parallel
vs. where it really is:#!/usr/local/bin/parallel
; could also try this instead#!/usr/bin/env parallel
, then it will just take parallel from your PATH, that's possibly the (mostly) portable way. However, I am not sure if the /usr/bin/env way handles parameters to the program, edit: see http://stackoverflow.com/questions/4303128/how-to-use-multiple-arguments-with-a-shebang-i-e which in conclusion means you can on many systems use only the correct absolute path (and#!/usr/bin/env parallel --argstothe program
most likely will not work).I wonder if recent builds of the binary expect it to be in
/usr/bin
. Still trying to troubleshoot a similar problem.I have built my own parallel and installed it in $HOME/bin which is in my PATH, worked fine for me.
Great. On a cluster, does one need to acquire/assign the cores to be used using MPI/smp first or can just run
parallel
without it.That would depend on the rules for your cluster. Default for GNU Parallel is to spawn one process per cpu core.
Its Rocks running JAVA SGE, I will test and see. Cheers
@ole.tange maybe you could briefly explain why parallel is superior to a for loop - aside from the shorter syntax.
The for loop executes the commands one at a time. Parallel can use multiple processors to run them in parallel.
A for loop can start many jobs simultaneously by putting them in the background ->
for i in *; do cat $i & done;
- but that way you may start 1000 jobs which is probably inefficient. Parallel does some clever load balancing.What about dealing with the many bioinformatics tools that are do not accept streams as input and insist on reading files instead? (e.g. blat I think). Is there an easy way to autogenerate and delete such files within a single line of "parallel"?
You can use named pipes to stream data to placeholder files, which can be used with some tools that do not read streams: http://en.wikipedia.org/wiki/Named_pipe
Wow - very cool!
amazing, upvote
In these cases, what I do is to write a wrapper script, which generates any parameter file needed for running the script.
One solution is to create a file:
cat file | parallel --pipe "cat >{#}; my_program {#}; rm {#}"
. Alex suggests using named pipes - which is more efficient, but does not work with every tool:cat file | parallel --pipe "mkfifo {#}; my_program {#} & cat >{#};rm {#}"
Hey Ole, how to bypass awk quotes, example (counting the reads in fastq files)
parallel 'echo && gunzip -c | wc -l | awk \'{print $1/4}\'' ::: *fastq.gz
wont workHi Sukhdeep, the following worked for me:
parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: *.gz
Super Alex, it works :)
In the past I've successfully convinced some tools (e.g. GATK HaplotypeCaller) to accept
/dev/stdin
&/dev/stdout
as input and output files, respectively. Give it a try ;)Use
--fifo
:Or
--cat
:How do we put
wait
command between different parallel runs ? I have a script that performs multiple jobs in order.Will that work?
why do you need to call wait ?
I am thinking that everything starts parallelly. I have to wait until the Jobs 1 finishes and then start jobs 2.
use GNU make with option
-j
Maybe explore GNU Parallel's semaphore options.
(Though a make-based process as Pierre suggests, or dedicated job scheduler is probably going to be easier to maintain.)
It will work, but there is really no need to call wait. GNU Parallel does that automatically. Try:
You will see that GNU Parallel only finishes after the last job is done.
What is the best way to deal with the error below?
Alternatively, if somebody could show me how to pipe to a command defined in a bash script, that would be just wonderful. Right now, I'm doing:
and blastFunction begins like this:
Try giving a text file which lists the input files to parallel instead of direct arguments. You can do this via 4 colons (
::::
)Get a list of files by:
Then do:
Thanks for the suggestion, but
yes, if the argument list is too long for bash, it won't work with any command, where you let the shell glob the file names.
Example "EXAMPLE: convert all PED files in a directory to BED" should work for this using find, you seem to have too many txt files
Thanks, this one worked
Of course optimal solution would be if I could skip the part of creating thousands and thousands of files..
you might try the
>>
operator next time you create filesUse Bash's builtin
printf
. Because it is builtin it is not subject to the same limit:Hi All, I need some help to run my two samples to run freebayes using parallel. I saw this in the previous post but got confused:
I am a bit confused I have two BAM files that need to be run with freebayes in one case I dont want to use
vcffirstheader
andvt normalise
and in the second run I want toLets say the files are S1.bam and S2.bam and the reference is hg38.fa. Also do I need
?
with the vcffirst header and vt normalise would it look like:
Can someone please help me?
Thanks
Hi I have a burning question,
I want to run a script named "predict_binding.py". Its syntax is:
file.txt has a column of strings with the same length:
predict_binding.py works with the first 3 arguments and string_1, then the 3 arguments and string_2, and so on.
That's fine, but now I have m argB, and I want to test all of them. I want to use the cluster for this, and this looks like a perfect job for parallel, isn't it?
After reading the manual and spending hours to try to make it work I realised I need some help.
What works so far (and is trivial) is:
This gives the same result as:
And indeed the flag --verbose says that the command looks like
./predict_binidng.py argA argBi argC ./file.txt
but I want to test all arg2, so I made a file called args.txt, which looks like this:
If I do:
I get an error from ./predict_binding saying:
predict_binding.py: error: incorrect number of arguments
And verbose says that the command looks like:
./predict_binding.py argA\ argBi\ argC\ ./file.txt
So, maybe those backslashes are affecting the input of ./predict_binding? How could I avoid them?
I have tried using double and single quotations " ', backslash \, backslash with single quote \', none has work!
I also tried:
Same error as above.
And also I tried to use a function like:
Interestingly, binding_func works for:
But if I do:
It gives the result for one arg but fails (same error as above) for the other.
If I put only argB1 in the args.txt file and do:
It fails miserably with the same error:
predict_binding.py: error: incorrect number of arguments
It seems a very trivial and easy problem but I haven't been able to solve it }:(
I would appreciate very much any help provided. :)
parallel ./predict_binding.py argA argB argC :::: ./file.txt
This is just wonderful. As you have mentioned above GNU Parallel to parallelize you own scripts which can be bash/python/perl etc which can take multiple IDs (i.e, arguments) at a single go. Does it do the other way so? which taking a single argument and run it in multiple cores of the computer???
How would you run a single argument on multiple cores?
What options are available if you want to utilize other machines but they require a password for ssh? Is there a way to force using rsh?
define a key with ssh-keygen ? http://rcsg-gsir.imsb-dsgi.nrc-cnrc.gc.ca/documents/internet/node31.html
Set up RSA keys for password-less SSH.
Have a look at: https://wiki.dna.ku.dk/dokuwiki/doku.php?id=ssh_config
Thank you so much!!! This turned a 5.5 hour blast+ job into 25 minutes!
Hello ole.tange
In case of blast, I was wondering what is the difference between using -num_threads and using parallel because when when I use parallel and do top it shows all processes are blast but cpu% is at 99-100 while I use -num_threads it shows only one process is blast but the cpu% is 5900. (I have 60 cores in the server)
I am having confusion in comprehending the two ideas !!!
You should use which ever works faster for you.
I am creating a single .fastq.gz file from many .fastq.gz files with the following command
zcat 15_S15*.fastq.gz | gzip -c > combined_file.fastq.gz
Now, I want to do it with parallel command.
Anyone help me
furthermore: you don't need
zcat |gzip
; see How To Merge Two Fastq.Gz Files?please ask this as a new question,