I've seen some posts here that have tried to parallelize bwa-mem for their sample data but all examples were paired-end data and the code was standalone (i.e. they only needed it for bwa mem and maybe converting SAM to BAM). But I also have to run my script for 1000s of different reference files and I have many steps within my script. My large bash script that I run on SLURM, processes single-end fastq files (40 fastq files) sequentially with various conversions and other scripts being called so I want to implement gnu-parallel. Would I need to parallelize each command block? Here is my attempt at putting my first few commands into a function and trying to run parallel on it but I'm getting confused by how to actually write out parallel:
#!/bin/bash
#SBATCH --node=2
#SBATCH --ntasks=96
#SBATCH --time=05:00:00
cd $SLURM_SUBMIT_DIR
module load gcc
module load samtools
module load bwa
module load gnu-parallel # new module I want to use to speed up my jobs
REF=$1 fasta reference file, I have 1000s for various tests so the IDs are saved in a reflist.txt file
GENE=$(basename $REF) # just the variable name without extensions
path=$PWD
function BWAMEM {
# First, index reference file
if [ -f $REF.bwt ]; then
echo "$REF already indexed, skip"; else
echo "$Creating index for $REF"
bwa index $REF; fi
# BWA MEM for each fastq file in CLEANED directory
for files in ../cleaned/*_cleaned.fastq
do
if [ -f $(basename ${files%%_cleaned.fastq}_${GENE}_aln.sam ]
then
echo "BWA MEM already done for $files, skipping"
else
echo "Running BWA MEM for $files"
bwa mem -B -t 20 $REF ${files} > ${files%%_cleaned.fastq}_${GENE}_aln.sam && \
mv ${files%%_cleaned.fastq}_${GENE}_aln.sam $ path
sleep 10
fi
done
}; export -f BWAMEM
parallel -j 20 BWAMEM ### can I leave it as this does it know to access cleaned fastq files?
next set of tasks
.
.
.
final set of tasks
So far I have stopped here because I am unsure if I even need to keep the for loop in my BWAMEM function as I need to supply ::: something to parallel. Would I instead get rid of the for loop and instead write:
parallel -j 20 BWAMEM :::: ../cleaned/*_cleaned.fastq
If I have 40 fastq files, then this parallel command should execute for 20 of them first while BWA MEM also uses 20 cores (for a total of 40 cores which is what the server allows) and then run the rest of my script which is just a lot of conversions and cleaning up the BAM files that get generated. And then I was wondering if all those conversion steps (some do take long too) also need to be parallelized?
OR...am I supposed to supply the list of REFERENCE files here because I do want to run this for 1000s of reference fasta files for various tests/tasks.
So maybe I should leave the for loop that accesses all cleaned fastq files and the final parallel command is:
parallel -j 20 BWAMEM ::: sed -e 's/.fa//g' reflist.txt
Your script requires improvement as the idea of
parallel
should be to replace thefor
loop entirely. List all relevant fastq files forst and then pipe this into a parallel command then runs thebwa
operations on them in parallel. Right now you are calling the files both from the parallel command and via thefor loop
which might be conflicting. See the linked thread for inspiration. By the way, if you are building from scratch, consider to learn how to do these kinds of workflows with a manager such assnakemake
.A: Optimize bwa mem in parallel on a SLURM cluster
I know but because I have multiple fastq ffiles to process for 1000s of ref genes (which the link you provided and others I've found here do not have this problem) I am unsure which part of my script to change. Do I use parallel to call the individual REF files of which I have thousands and in that case I leave the for loop to loop through the 40 fastq files...OR do I use parallel to run 20 fastq files at a time but somehow I have to execute it 1000s of times for each REF file? To me the former makes more sense solely because there are 1000s of ref files for testing but I've only seen parallel bwa mem examples where they feed the fastq files. Hence my confusion :(