The most important thing is to stop ls
or bash pattern matching in loops. Those are error-prone constructs and in addition, you need to constantly work backwards to generate simpler names.
Instead, collect the so-called root names of the study into a file. For example your sample.txt
could contain:
sample1
sample2
sample3
Use this samples.txt
file to build and run commands via GNU Parallel
Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them
Now, all of a sudden, instead of bending backward to generate the right names, you can simply use the root to construct the new name like so:
cat samples.txt | parallel "bwa mem ... genome.fa \
{}_R1_001.fq {}_R2_001.fq | samtools sort > {}.bam"
see how much simpler it is? Now to see what these commands generate but without running the commands put an echo
in front and escape the bash redirects:
parallel echo "bwa mem -R '@RG\tID:{}\tSM:{}\tPL:ILLUMINA' \
genome.fa {}_R1_001.fq {}_R2_001.fq \| samtools sort \> {}.bam"
now you can see your commands that are about to execute
bwa mem -R @RG\tID:sample1\tSM:sample1\tPL:ILLUMINA genome.fa sample1_R1_001.fq sample1_R2_001.fq | samtools sort > sample1.bam
bwa mem -R @RG\tID:sample2\tSM:sample2\tPL:ILLUMINA genome.fa sample2_R1_001.fq sample2_R2_001.fq | samtools sort > sample2.bam
bwa mem -R @RG\tID:sample3\tSM:sample3\tPL:ILLUMINA genome.fa sample3_R1_001.fq sample3_R2_001.fq | samtools sort > sample3.bam
you can run the first command alone to see if it does what you think it should be doing. If it does the right thing you can remove the echo and the escapes and have run in parallel.
Plus GNU parallel can use multiple CPUs and actually run things in parallel.