Question

Submitting a job in parallel on slurm

0

Entering edit mode

3 months ago

Arthur ▴ 10

I need to submit jobs on Slurm using parallel processing, so that the code within the shell script is completed within each file folder. If there are 30 folders, and one file to process within each folder separately, I need the shell script to work for all the folders simultaneously but separately from one another.

#SBATCH --ntasks 1
#SBATCH --time 8:0:0 
#SBATCH --mem 400G 
#SBATCH --qos bbdefault 
#SBATCH --mail-type=END

samtools bam2fq file.bam> fastq

Does anyone know how I can sbmit the above job in parallel on slurm?

Many thanks,
Arthur

bam slurm shell parallel • 941 views

ADD COMMENT • link updated 3 months ago by rfran010 ★ 1.3k • written 3 months ago by Arthur ▴ 10

score 0 · Answer 1 · 2024-08-05

see https://nf-co.re/bamtofastq/2.1.1/

$ cat  slurm.cfg

process {
executor="slurm"
clusterOptions = "--qos bbdefault  --mail-type=END"
cache="lenient"
maxForks=100
errorStrategy= "finish"
}


$ find /dir1/dir2 -type f -name "*.bam" | samtools samples | awk -F '\t' 'BEGIN{printf("sample_id,mapped,index,file_type\n");} {printf("%s,%s,%s.bai,bam\n",$1,$2,$2);}' > samplesheet.csv

$ nextflow run nf-core/bamtofastq \
   -profile singularity \
   -c slurm.cfg \
   --input samplesheet.csv \
   --outdir output.dir

score 0 · Answer 2 · 2024-08-05

0

Entering edit mode

3 months ago

rfran010 ★ 1.3k

I would add an argument then use a for loop to submit. This will submit all jobs in parallel, but they won't run strictly in parallel...

#SBATCH --ntasks 1
#SBATCH --time 8:0:0 
#SBATCH --mem 400G 
#SBATCH --qos bbdefault 
#SBATCH --mail-type=END

INPUT_BAM=${1}
OUTPUT_FQ1=${INPUT_BAM/.bam/}.R1.fq
OUTPUT_FQ2=${INPUT_BAM/.bam/}.R2.fq

samtools bam2fq -1 ${OUTPUT_FQ1} -2 ${OUTPUT_FQ2} -0 /dev/null -s /dev/null -n ${INPUT_BAM} 

#alternatively:
INPUT_BAM=${1}
OUTPUT_FQ=${INPUT_BAM/.bam/}.fq

samtools bam2fq ${INPUT_BAM} > ${OUTPUT_FQ}

then submit as

for file in folders/*file.bam; do sbatch script.sh ${file} ; done

here folders/*file.bam needs to be a pattern that matches the file names you want to run. You can confirm with ls if the pattern matches the right files.

Edit: updated according to GenoMax's comment.

ADD COMMENT • link 3 months ago by rfran010 ★ 1.3k

0

Entering edit mode

The folders are named x1001, x1467, x1783 etc. so there is no obvious pattern by name, other than the x.

ADD REPLY • link 3 months ago by Arthur ▴ 10

0

Entering edit mode

What about filenames?

I imagine x*/*bam should work.

You can be more specific with the pattern by using regex or listing the folder names in a file, but it gets a little more complex.

ADD REPLY • link 3 months ago by rfran010 ★ 1.3k

0

Entering edit mode

The file names have the sam numbering as the folder.

ADD REPLY • link 3 months ago by Arthur ▴ 10

0

Entering edit mode

These numbers seem specific (non-sequential), so maybe you can also use whatever source was used to create them.

Otherwise, the pattern above should work, but might capture additional files you don't want... shouldn't be a serious error though.

ADD REPLY • link 3 months ago by rfran010 ★ 1.3k

0

Entering edit mode

but they won't run strictly in parallel...

A number of jobs out of total should run in parallel up until the allocation of resources for OP's account. e.g. 4 may start if 40 are submitted.

While this is on OP, asking for 400G of RAM for a single job is overkill. It would make the jobs run one at a time, unless OP's cluster has very liberal allowances for RAM usage.

I am also not sure if > is the right way of doing this. samtools bam2fq of recent vintage has -1,-2 options to write reads out to respective files. Otherwise they may all end up in a single file.

From in-line help

samtools bam2fq -1 pair1.fq -2 pair2.fq -0 /dev/null -s /dev/null -n in.bam

ADD REPLY • link 3 months ago by GenoMax 147k

0

Entering edit mode

I agree 400G seems like too much and thanks for checking on OP's command. Updated accordingly.

ADD REPLY • link 3 months ago by rfran010 ★ 1.3k

0

Entering edit mode

There is only one file to process per folder.

ADD REPLY • link 3 months ago by Arthur ▴ 10

0

Entering edit mode

Right, so file is probably not so large it needs 400G of RAM, right?

And if the file contains PE reads, then do you want to output them into a READ1 and READ2 files?

ADD REPLY • link 3 months ago by rfran010 ★ 1.3k

1

Entering edit mode

The memory needs to come down to about 20G per job.

No, there is only one BAM file per folder with combined reads, which don't need to be separated. Only one FASTQ file is to be generated per folder.

ADD REPLY • link 3 months ago by Arthur ▴ 10

0

Entering edit mode

I am using the PE reads combined within one file per folder. They will not be separated into R1, R2 at any stage. Hopefully that is ok do to for an NGS pipeline?

ADD REPLY • link 3 months ago by Arthur ▴ 10

0

Entering edit mode

I am using the PE reads combined within one file per folder. They will not be separated into R1, R2 at any stage. Hopefully that is ok do to for an NGS pipeline?

Don't know what you mean by "combined", but it probably not OK (depends on what you are going to do) to cat/merge R1/R2 reads into a single file end to end (unless the reads are interleaved and even then only a select few programs can use data in that format).

ADD REPLY • link 3 months ago by GenoMax 147k

0

Entering edit mode

I have the R1 and R2 reads combined within the one BAM file. This is how it was received from the raw reads phase of the NGS pipeline. I have been able to use the BAM files throughout the NGS pipeline through to variant calling and annotation without any issues so far.

ADD REPLY • link 3 months ago by Arthur ▴ 10

0

Entering edit mode

It really depends on what you want to do. You will essentially create single-end reads. This will lose information on the fragment size and any anchoring of ambiguous reads, and then if you want to re-establish PE status, it will be more complex/difficult. bam files, keep track of all this information, so you lose it when you make the conversion to fastq.

ADD REPLY • link 3 months ago by rfran010 ★ 1.3k

0

Entering edit mode

Previous files are retained thoughout the NGS pipeline, so it is possible to make changes if needed. But the information I have so far suggests that a combined BAM file is fine. The aim is to identify variants and their annotations in the exome.

ADD REPLY • link 3 months ago by Arthur ▴ 10

0

Entering edit mode

The above script did not work unfortunately. I got the following error message on Slurm:-

Failed to read header for "-"

If I give the below as an example, how would this be computed into the script?:-

Folder X1001 contains X1001.Bam

Folder X1290 contains X1290.Bam

Folder X1383 contains X1383.Bam

ADD REPLY • link 3 months ago by Arthur ▴ 10

0

Entering edit mode

give the command and script you used and maybe I can help.

https://github.com/bobbyiliev/introduction-to-bash-scripting/blob/main/ebook/en/content/004-bash-variables.md

ADD REPLY • link 3 months ago by rfran010 ★ 1.3k