Question

slurm batch

0

Entering edit mode

11 months ago

bestone ▴ 30

Hello guys,

I wanna run multiple jobs with slurm batch command but I couldn't figure it out. I have a command but it doesn't work. I added it below. some of my data is coming from illumiuna so they are fastq1, fastq2 but some of the files do not have fastq1 and 2 but also four different fastq files because these files are coming from Pacbio. how can running all of them only with one command? Could you pls help me with this issue?

#!/bin/bash
#SBATCH --time=24:00:00
#SBATCH --partition=barbun
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=u....@gmail.com
#SBATCH --ntasks-per-node=16

BWA=~/tmm/bwa.kit/bwa
SAMTOOLS=~/tmm/bwa.kit/samtools
PICARD=/truba/home/ue/Bioinformatic_workflow/programs/picard.jar
GATK=~/gatk4-ulak/gatk-4.2.6.1/gatk
REFSEQ=/truba/home/ue/Bioinformatic_workflow/ref_seq/prunus_armeniaca_gca.903112645.fasta

FASTQ_1=/truba/home/ue/whole_genome/B11/B11_1.fq.gz
FASTQ_2=/truba/home/ue/whole_genome/B11/B11_2.fq.gz
OUTPUT_DIR=/truba/home/ue/Bioinformatic_workflow/b11_workflow/output_files_for

REFSEQ=$1
FASTQ_1=$2
FASTQ_2=$3
SAMPLE_NAME=$4
OUTPUT_DIR=$5

if [ $# -lt 5 ]; then
  echo "Usage: $0 REFSEQ FASTQ_1 FASTQ_2 SAMPLE_NAME OUTPUT_DIR"
  exit 1
fi

$BWA mem -t $SLURM_NTASKS_PER_NODE -R "@RG\tID:$SAMPLE_NAME\tSM:$SAMPLE_NAME\tPL:ILLUMINA" $REFSEQ $FASTQ_1 $FASTQ_2 > $OUTPUT_DIR/${SAMPLE_NAME}_output.sam

slurm samtools bwa gatk • 707 views

ADD COMMENT • link updated 11 months ago by Michael 55k • written 11 months ago by bestone ▴ 30

1

Entering edit mode

The script is using BWA mem to align your paired-end files, which is fine, but for PacBio it's better to use another aligner such as Minimap2, you can align each Fastq separately and then merge the BAMs, or merge the Fastq per sample and run the aligner.

ADD REPLY • link 11 months ago by JC 13k

0

Entering edit mode

Thank you so much for your reply JC. But what I want to do is to do all of these analyses with a single command. For example, after all, fastqs are analyzed with BWA, they are analyzed with samtools and then with a single command with Gatk. After analyzing with Bwa, I do not want to write commands for all of them separately.

ADD REPLY • link 11 months ago by bestone ▴ 30

1

Entering edit mode

As JC explained this isn't really recommended and you will have to treat paired-end and non-paired end files differently. If you try to mix these processes in one job (with some sort of automatic detection based on file name, which it is definitely feasible) you are more prone to create a mess. Also, this defies the proper use of a cluster. Instead, I recommend to keep these "low complexity" scripts that do a single task properly.

Use the script above for PE Illumina data only
Create another script with a proper command line using minimap2 for long read data
put long and PE Illumina data into different folders (e.g. PacBio_seq, Illumina_seq), they need different pre-processing and QC anyway
Submit a batch job for each pair of files of Illumina files and for each Pacbio file using sbatch in a for loop in each folder separately (you still need to manually specify some of parameters (sample name, reference sequence) anyway, consider if those can be hardcoded or derived from filenames)
1. If you need more automation consider using snakemake or makefiles

ADD REPLY • link 11 months ago by Michael 55k