Question

How to parallelize bwa mem for single-end data within a larger bash script?

1

Entering edit mode

5.2 years ago

DNAngel ▴ 250

I've seen some posts here that have tried to parallelize bwa-mem for their sample data but all examples were paired-end data and the code was standalone (i.e. they only needed it for bwa mem and maybe converting SAM to BAM). But I also have to run my script for 1000s of different reference files and I have many steps within my script. My large bash script that I run on SLURM, processes single-end fastq files (40 fastq files) sequentially with various conversions and other scripts being called so I want to implement gnu-parallel. Would I need to parallelize each command block? Here is my attempt at putting my first few commands into a function and trying to run parallel on it but I'm getting confused by how to actually write out parallel:

#!/bin/bash
#SBATCH --node=2
#SBATCH --ntasks=96
#SBATCH --time=05:00:00

cd $SLURM_SUBMIT_DIR

module load gcc
module load samtools
module load bwa
module load gnu-parallel # new module I want to use to speed up my jobs

REF=$1 fasta reference file, I have 1000s for various tests so the IDs are saved in a reflist.txt file 
GENE=$(basename $REF) # just the variable name without extensions   
path=$PWD

function BWAMEM {
# First, index reference file
if [ -f $REF.bwt ]; then
echo "$REF already indexed, skip"; else
echo "$Creating index for $REF"
bwa index $REF; fi

# BWA MEM for each fastq file in CLEANED directory
for files in ../cleaned/*_cleaned.fastq
do
    if [ -f $(basename ${files%%_cleaned.fastq}_${GENE}_aln.sam ]
    then
        echo "BWA MEM already done for $files, skipping"
    else
        echo "Running BWA MEM for $files"
        bwa mem -B -t 20 $REF ${files} > ${files%%_cleaned.fastq}_${GENE}_aln.sam && \ 
        mv ${files%%_cleaned.fastq}_${GENE}_aln.sam $ path
        sleep 10
    fi
done
}; export -f BWAMEM
parallel -j 20 BWAMEM ### can I leave it as this does it know to access cleaned fastq files?

next set of tasks
.
.
.
final set of tasks

So far I have stopped here because I am unsure if I even need to keep the for loop in my BWAMEM function as I need to supply ::: something to parallel. Would I instead get rid of the for loop and instead write:

parallel -j 20 BWAMEM :::: ../cleaned/*_cleaned.fastq

If I have 40 fastq files, then this parallel command should execute for 20 of them first while BWA MEM also uses 20 cores (for a total of 40 cores which is what the server allows) and then run the rest of my script which is just a lot of conversions and cleaning up the BAM files that get generated. And then I was wondering if all those conversion steps (some do take long too) also need to be parallelized?

OR...am I supposed to supply the list of REFERENCE files here because I do want to run this for 1000s of reference fasta files for various tests/tasks.

So maybe I should leave the for loop that accesses all cleaned fastq files and the final parallel command is:

parallel -j 20 BWAMEM ::: sed -e 's/.fa//g' reflist.txt

bash bwa-mem gnu-parallel • 2.0k views

ADD COMMENT • link updated 5.2 years ago by Pierre Lindenbaum 164k • written 5.2 years ago by DNAngel ▴ 250

1

Entering edit mode

Your script requires improvement as the idea of parallel should be to replace the for loop entirely. List all relevant fastq files forst and then pipe this into a parallel command then runs the bwa operations on them in parallel. Right now you are calling the files both from the parallel command and via the for loop which might be conflicting. See the linked thread for inspiration. By the way, if you are building from scratch, consider to learn how to do these kinds of workflows with a manager such as snakemake.

A: Optimize bwa mem in parallel on a SLURM cluster

ADD REPLY • link 5.2 years ago by ATpoint 86k

0

Entering edit mode

I know but because I have multiple fastq ffiles to process for 1000s of ref genes (which the link you provided and others I've found here do not have this problem) I am unsure which part of my script to change. Do I use parallel to call the individual REF files of which I have thousands and in that case I leave the for loop to loop through the 40 fastq files...OR do I use parallel to run 20 fastq files at a time but somehow I have to execute it 1000s of times for each REF file? To me the former makes more sense solely because there are 1000s of ref files for testing but I've only seen parallel bwa mem examples where they feed the fastq files. Hence my confusion :(

ADD REPLY • link 5.2 years ago by DNAngel ▴ 250

score 0 · Answer 1 · 2019-10-25

you should use a workflow manager like nextflow or snakemake.

nevertheless, here is a solution using make (parallelize with option -j )

REFS=$(shell find /home/lindenb/src/jvarkit-git/src/test/resources/ -type f -name "*.fa" -o -name "*.fasta" )
FASTQS=$(shell find /home/lindenb/src/jvarkit-git/src/test/resources/ -type f -name "*.fastq.gz" -o -name "*.fq.gz")

define run1

$(1).bwt: $(1)
    bwa index $$<

endef


define run2

$(lastword $(subst /, ,$(dir $(2) )))_$(notdir $(1))_$(notdir $(2)).sam: $(1).bwt $(2)
    bwa mem $(1) $(2) > $$@
endef


all: $(foreach R,$(REFS),$(foreach Q,$(FASTQS), $(lastword $(subst /, ,$(dir $Q )))_$(notdir $R)_$(notdir $Q).sam ))

$(eval $(foreach R,$(REFS),$(call run1,$R)))
$(eval $(foreach R,$(REFS),$(foreach Q,$(FASTQS),$(call run2,$R,$Q))))