I need to submit jobs on Slurm using parallel processing, so that the code within the shell script is completed within each file folder. If there are 30 folders, and one file to process within each folder separately, I need the shell script to work for all the folders simultaneously but separately from one another.
#SBATCH --ntasks 1
#SBATCH --time 8:0:0
#SBATCH --mem 400G
#SBATCH --qos bbdefault
#SBATCH --mail-type=END
samtools bam2fq file.bam> fastq
Does anyone know how I can sbmit the above job in parallel on slurm?
Many thanks,
Arthur
The folders are named x1001, x1467, x1783 etc. so there is no obvious pattern by name, other than the x.
What about filenames?
I imagine
x*/*bam
should work.You can be more specific with the pattern by using regex or listing the folder names in a file, but it gets a little more complex.
The file names have the sam numbering as the folder.
These numbers seem specific (non-sequential), so maybe you can also use whatever source was used to create them.
Otherwise, the pattern above should work, but might capture additional files you don't want... shouldn't be a serious error though.
A number of jobs out of total should run in parallel up until the allocation of resources for OP's account. e.g. 4 may start if 40 are submitted.
While this is on OP, asking for 400G of RAM for a single job is overkill. It would make the jobs run one at a time, unless OP's cluster has very liberal allowances for RAM usage.
I am also not sure if
>
is the right way of doing this.samtools bam2fq
of recent vintage has-1
,-2
options to write reads out to respective files. Otherwise they may all end up in a single file.From in-line help
I agree 400G seems like too much and thanks for checking on OP's command. Updated accordingly.
There is only one file to process per folder.
Right, so file is probably not so large it needs 400G of RAM, right?
And if the file contains PE reads, then do you want to output them into a READ1 and READ2 files?
The memory needs to come down to about 20G per job.
No, there is only one BAM file per folder with combined reads, which don't need to be separated. Only one FASTQ file is to be generated per folder.
I am using the PE reads combined within one file per folder. They will not be separated into R1, R2 at any stage. Hopefully that is ok do to for an NGS pipeline?
Don't know what you mean by "combined", but it probably not OK (depends on what you are going to do) to
cat
/merge R1/R2 reads into a single file end to end (unless the reads are interleaved and even then only a select few programs can use data in that format).I have the R1 and R2 reads combined within the one BAM file. This is how it was received from the raw reads phase of the NGS pipeline. I have been able to use the BAM files throughout the NGS pipeline through to variant calling and annotation without any issues so far.
It really depends on what you want to do. You will essentially create single-end reads. This will lose information on the fragment size and any anchoring of ambiguous reads, and then if you want to re-establish PE status, it will be more complex/difficult. bam files, keep track of all this information, so you lose it when you make the conversion to fastq.
Previous files are retained thoughout the NGS pipeline, so it is possible to make changes if needed. But the information I have so far suggests that a combined BAM file is fine. The aim is to identify variants and their annotations in the exome.
The above script did not work unfortunately. I got the following error message on Slurm:-
Failed to read header for "-"
If I give the below as an example, how would this be computed into the script?:-
Folder X1001 contains X1001.Bam
Folder X1290 contains X1290.Bam
Folder X1383 contains X1383.Bam
give the command and script you used and maybe I can help.
https://github.com/bobbyiliev/introduction-to-bash-scripting/blob/main/ebook/en/content/004-bash-variables.md