Hello,
I would like to understand what is the best way of using bwa in parallel in a SLURM cluster. Obviously, this will depend on the computational limits that I have as user.
bwa software has an argument "-t" specifying the number of threads. Let's imagine that I use bwa mem -t 3 ref.fa sampleA.fq.gz
, this will mean that bwa split the job on three tasks/threads. In other words, it will align three reads at a time in parallel (I guess).
Now, if I want to run this command on several samples and in a SLURM cluster, Shall I specify the number of tasks as for bwa mem, and specify the number of CPUs per task(for instance 2)? Which would be:
sbatch -c 2 -n 3 bwa.sh
where bwa.sh containes:
cat data.info | while read indv; do
bwa mem -t 3 ref.fa sample${indv}.fq.gz
done
Do you have any suggestion? Or can you improve/correct my reasoning?
Nicolas Rosewick : It would be useful to show a small example of what
sample_sheet.txt
file should look like (for PE data). Also for human sized data 10G RAM is not sufficient so it would be best to make a note of that.indeed 10GB is maybe too small. It was for the example ;) I added an example of sample sheet also.
Ok, this is a good solution. But imagine that the script has also a command line to index the genome. Could I run the array from inside the script? And not from outside as you are suggesting. I hope it is clear my question.
You need to index the genome only once so it can be a single one-time job.
Thank you, however, it is still unclear to me where do you use the information on the "sample_sheet.txt". In addition, $SLURM_ARRAY_TASK_ID, files are all variable that you define externally of the script?
$SLURM_ARRAY_TASK_ID is a variable that report the job array ID. check here for information concerning job arrays : https://slurm.schedmd.com/job_array.html .
Ok, this is clear. But in your script, you do not seem to use the "samplesheet" variable. Or may be I did not understand.
Indeed my bad. copy/paste error. I solved it now