All new computers have multiple cores. Many bioinformatics tools are serial in nature and will therefore not use the multiple cores. However, many bioinformatics tasks (especially within NGS) are extremely parallelizeable:
- Run the same program on many files
- Run the same program on every sequence
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
EXAMPLE: Replace a for-loop
It is often faster to write a command using GNU Parallel than making a for
loop:
for I in *gz; do
zcat $i > $(basename $i .gz).unpacked
done
can be written as:
parallel 'zcat {} > {.}.unpacked' ::: *.gz
The added benefit is that the zcat
s are run in parallel - one per CPU core.
EXAMPLE: Blast on multiple machines
Assume you have a 1 GB fasta file that you want blast, GNU Parallel can then split the fasta file into 100 KB chunks and run 1 jobs per CPU core:
cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results
If you have access to the local machine, server1 and server2, GNU Parallel can distribute the jobs to each of the servers. It will automatically detect how many CPU cores are on each of the servers:
cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result
EXAMPLE: Running experiments
Experiments often have several parameters where every combination should be tested. Assume we have a program called experiment
that takes 3 arguments: --age --sex --chr:
experiment --age 18 --sex M --chr 22
Now we want to run experiment
for every combination of ages 1..80, sex M/F, chr 1..22+XY:
parallel experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y
To save the output in different files you could do:
parallel experiment --age {1} --sex {2} --chr {3} '>' output.{1}.{2}.{3} ::: {1..80} ::: M F ::: {1..22} X Y
But GNU Parallel can structure the output into directories so you avoid having thousands of output files in a single dir:
parallel --results outputdir experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y
This will create files like outputdir/1/80/2/M/3/X/stdout
containing the standard output of the job.
If you have many different parameters it may be handy to name them:
parallel --result outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y
Then the output files will be named like outputdir/AGE/80/CHR/Y/SEX/F/stdout
If one of your parameters take on many different values, these can be read from a file using '::::'
echo AGE > age_file
seq 1 80 >> age_file
parallel --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y
Learn more
See more examples:
- Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them
- http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html Your command line will love you for it.
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
I may have run into a bug with the current version:
This is being done on an Ubuntu 12.04 host, running GCC 4.8.1:
Why is bash looking in /usr/bin when you installed in /usr/local/bin? Try:
And try:
I'm not sure why
bash
is looking in/usr/bin
, but I'm not sure if setting up a symbolic link is the right solution. I'll try compiling an older version some time next week.