Hi,
I am trying to parallelise a loop in my bash script so I can index multiple genomes at once rather than sequentially, but I'm really struggling to get it working. I also want to learn how to do this as there are many other parts of the pipeline I would like to parallelise.
I am submitting the job to an HPC with a Linux environment that uses the SLURM workload manager, my script is as follows:
#!/bin/bash
#SBATCH --job-name=parallel_indexing
#SBATCH --output=parallel_indexing%j.log
# send email when job begins ends or aborts
#SBATCH --mail-user=<email> --mail-type=BEGIN,END,ABORT
#request resources
#SBATCH --ntasks=5
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=500G
#SBATCH --time=12:00:00
#SBATCH --partition=k2-medpri
# set working directory to scratch space project folder
#SBATCH --chdir <dir>
#load modules
module load <bowtie>
#initiate arrays
GENOMES=(2015_genome.fa 2021_genome.fa liv_genome.fa wash_genome.fa human_genome.fa)
NAMES=(2015_genome 2021_genome liv_genome wash_genome human_genome)
for index in ${!GENOMES[*]}; do
bowtie-build ${GENOMES[$index]} ${NAMES[$index]} &
done
The job runs for ~4 seconds then exits without error and the log file is completely empty. If anyone has any advice it would be greatly appreciated!
use a worflow manager like snakemake or nextflow.
An easy option would be to submit several jobs. Create several scripts, one per genome, and submit them to the cluster.
Hi, thank you for the reply - this would work of course but this is just one part of a larger pipeline and for the rest this wouldn't be practical (many more files than just the 5 here) so I'm really keen to understand how to parallelise if possible!
The simple answer is:
Don't use a loop - use
GNU parallel
.