I'm working on a SLURM cluster with NGS data. I trimmed raw reads and was thinking of the best way to align them to the reference genome. I have pairs of reads for a few samples. I wrote a script for parallel bwa:
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=10
#SBATCH --nodes=1
# align with bwa & convert to bam
bwatosam() {
id=$1
index=$2
output=$3/"$id".bam
fq1=$4/"$id".R1.fq.gz
fq2=$4/"$id".R2.fq.gz
bwa mem -t 16 -R '@RG\tID:"$id"\tSM:"$id"\tPL:ILLUMINA\tLB:"$id"_exome' -v 3 -M $index $fq1 $fq2 |
samtools view -bo $output
};
export -f bwatosam
# run bwatosam in parallel
ls trimmed/*.R1.fq.gz |
xargs -n 1 basename |
awk -F ".R1" '{print $1 | "sort -u"}' |
parallel -j $SLURM_NTASKS "bwatosam {} index.fa alns trimmed"
But I'm not sure if I use the right parameters (#SBATCH) for the job because if I do it without -j:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=5
# run bwatosam in parallel
ls trimmed/*.R1.fq.gz |
xargs -n 1 basename |
awk -F ".R1" '{print $1 | "sort -u"}' |
parallel "bwatosam {} index.fa alns trimmed"
It works 10 times faster. What number of nodes/cpus/threads should I use?
Have you tried submitting jobs directly to SLURM without the additional complexity of using
parallel
. On a cluster (withparallel
) you are adding complexity for no good reason as far as I see.