Question

bash script

0

Entering edit mode

6 months ago

bestone ▴ 30

Hello everyone, there are 32 different apricot genotypes, and I want to analyze their whole genome sequence. I tried to write a bash script code for this, but I could not succeed. I named all the raw data as Genotype1.R1.fg.gz, Genotype1.R2.fg.gz, Genotype2.R1.fg.gz, Genotype2.R2.fg.gz.... 32 of them. Instead of analyzing them one by one. Can you help me write one code and do all the analysis? With the analysis I will do: BWA, SAMTOOLS, PICARD, GATK

whole-genome-sequencing • 1.2k views

ADD COMMENT • link updated 6 months ago by Ram 44k • written 6 months ago by bestone ▴ 30

1

Entering edit mode

#!/bin/bash
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=..........@gmail.com
#SBATCH --ntasks-per-node=16

set -eu # stop on any error or unbound variable

BWA=~/tmm/bwa.kit/bwa
SAMTOOLS=~/tmm/bwa.kit/samtools
PICARD=/truba/home/ue/Bioinformatic_workflow/programs/picard.jar
GATK=~/gatk4-ulak/gatk-4.2.6.1/gatk
REFSEQ="/truba/home/ue/Bioinformatic_workflow/ref_seq/prunus_armeniaca_gca.903112645.fasta"

SET_DIR="/truba/home/ue/set"
mkdir -p analysis_results
for r1_file in "$SET_DIR"/Genotip*.R1.fq.gz; do
    r2_file="$r1_file%.R1.fq.gz.R2.fq.gz" # This isn't correct variable replacement syntax, I think you want
    r2_file=${r1_file/.R1.fq.gz/.R2.fq.gz} 
    sample_name=$ "$r1_file" .R1.fq.gz 
    genom_name="$sample_name_genom.fasta"  # I think you need:
    genom_name=${sample_name}_genom.fasta 

   ### You should try to connect these commands with pipes to avoid writing out redundant files

   bwa mem -t 4 prunus_armeniaca_gca.903112645.fasta "$r1_file" "$r2_file" > "${genom_name}_aligned.sam"
    ### should use the REFSEQ variable:
   bwa mem -t 4 $REFSEQ "$r1_file" "$r2_file" > "${genom_name}_aligned.sam"

    samtools view -bS "${genom_name}_aligned.sam" > "${genom_name}_aligned.bam"
    samtools sort "${genom_name}_aligned.bam" -o "${genom_name}_sorted.bam"
    samtools index "${genom_name}_sorted.bam"
    ### and here too:
    gatk HaplotypeCaller -R $REFSEQ -I "${genom_name}_sorted.bam" -O "${genom_name}_raw_snps.vcf"
   gatk VariantFiltration -V ${genom_name}_raw_snps.vcf -O ${genom_name}_filtered_snps.vcf --filter-name "QUAL_FILTER" --filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0" 

  mv "${genom_name}_sorted.bam" analysis_results/
  mv "${genom_name}_sorted.bam.bai" analysis_results/
  mv "${genom_name}_raw_snps.vcf" analysis_results
done

Note: I tried this code but it didn't work

ADD REPLY • link updated 6 months ago by Michael 55k • written 6 months ago by bestone ▴ 30

1

Entering edit mode

This is a SLURM jobscript. The obvious error I can see now is that the variable $sample_name_genom is not defined and that you are not using the $REFSEQ variable properly to pass the reference to all programs. Further: --filter-name$doesn't look good, and the name should precede the expression (I think).

To spot these mistakes automatically, I recommend to always add the set -eu directive as the line after the #SBATCH parameter block. There are other things that need to be optimized, but for further help, we need the exact error messages that occur, not just 'didn't work'.

ADD REPLY • link 6 months ago by Michael 55k

1

Entering edit mode

you'd better learn how to use a workflow manager like nextflow or snakemake instead of using bash loops + slurm.

ADD REPLY • link 6 months ago by Pierre Lindenbaum 164k

2

Entering edit mode

Or use array jobs, or submit jobs in a loop. I don't see any problems in using loops. I use snakemake on slurm and still like bash loops and use them to submit actual snakemake jobs that run on slurm nodes and trigger child jobs on other slurm nodes. It doesn't have to be one or the other.

IMO OP is starting just now and will evolve to a place where they will need workflow management tools. They should be made aware that tools exist and they can use them along with other tools they're familiar with instead of asking them to replace what they know with something entirely new.

ADD REPLY • link 6 months ago by Ram 44k

0

Entering edit mode

I run it but it says:

sbatch: error: Batch job submission failed: Unspecified error

ADD REPLY • link updated 6 months ago by Ram 44k • written 6 months ago by bestone ▴ 30

1

Entering edit mode

That looks like a problem with your cluster setup and you could not run any job. Try to make a simple job script first and debug that until it is running properly. I noticed there may be a few parameters required for a SLURM job that could be missing from your script (partition to use, account name, time). Find some local support to figure this out. We cannot do this from here because every cluster configuration is slightly different. Also, some variables can be set via the environment, I believe the error is coming from some setting you haven't shown.

What happens for example if you do: srun -N 1 -t 0:01:00 bash -c "uname -a"

ADD REPLY • link 6 months ago by Michael 55k

0

Entering edit mode

Thank you for replying. I changed it but still couldn't figure it out. I added the last code that I run it

#!/bin/bash
#SBATCH --time=24:00:00
#SBATCH --partition=barbun
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=....@gmail.com
#SBATCH --ntasks-per-node=16

BWA=~/tmm/bwa.kit/bwa
SAMTOOLS=~/tmm/bwa.kit/samtools
PICARD=/truba/home/ue/Bioinformatic_workflow/programs/picard.jar
GATK=~/gatk4-ulak/gatk-4.2.6.1/gatk
REFSEQ="/truba/home/ue/Bioinformatic_workflow/ref_seq/prunus_armeniaca_gca.903112645.fasta"

mkdir -p BAMS

forwardreads=(/truba/home/ue/set/*_R1*)

for forwardread in ${forwardreads[@]}
do

reversead=$(echo $forwardread | sed 's\R1\R2\g')
outputfilename=$(basename $forwardread _R1.fq.gz)


$BWA mem -t 4 $REFSEQ $forwardread $reversead >"BAMS/${outputfilename}_aligned.sam"
$SAMTOOLS view -bS "BAMS/${outputfilename}_aligned.sam" > "BAMS/${outputfilename}_aligned.bam"
$SAMTOOLS sort "BAMS/${outputfilename}_aligned.bam" -o "BAMS/${outputfilename}_aligned_sorted.bam"
$SAMTOOLS index "BAMS/${outputfilename}_aligned_sorted.bam"

done

ADD REPLY • link updated 6 months ago by GenoMax 146k • written 6 months ago by bestone ▴ 30

2

Entering edit mode

That doesn't answer my question. I first wanted to make sure that you can run any job at all. Also, please remember to always post the error message. I am unable to read your mind (un-)fortunately.

ADD REPLY • link 6 months ago by Michael 55k

0

Entering edit mode

You write this "I noticed there may be a few parameters required for a SLURM job that could be missing from your script (partition to use, account name, time)." but when you look at my bash you will see them they are not missing and also If you read the messages I sent more carefully, you would see that I wrote the error code at the top. Thank you for replying.

ADD REPLY • link 6 months ago by bestone ▴ 30

0

Entering edit mode

What "error code at the top" are you talking about? Error code in response to the srun -N 1 -t 0:01:00 bash -c "uname -a" command? Also, Michael clearly asks you to find local support for that (possible) problem because no one outside your cluster admin/users can help you with things specific to your cluster. You seem to be addressing that problem and not the one that he said he can help you with.

ADD REPLY • link 6 months ago by Ram 44k