Question

Blastn produce a single ouput file instead of multiple output files

0

Entering edit mode

12 months ago

neha.jnu463 • 0

Hello,

I attempted to conduct a BLASTn analysis on multiple FASTA files using a for loop, but instead of generating individual output files for each queried FASTA file, it produced a single output file. Can you please help me with what I must change in my code to get the desired outcome?

Here's my code

#!/bin/bash
#SBATCH -J rep.euk.genomes
#SBATCH -N 1
#SBATCH --ntasks-per-node=64
#SBATCH -o %x.%j.out
#SBATCH -e %x.%j.err
#SBATCH -p nocona
#SBATCH --export=ALL

export BLASTDB=$BLASTDB:/lustre/research/phillips/rep_euk_genomes/db/

for f in *.fasta
do
outfile=blastn.${f}
/lustre/work/sneha/software/ncbi-blast-2.11.0+/bin/blastn -query ${f} -db /lustre/research/phillips/rep_euk_genomes/db/ref_euk_rep_genomes -max_target_seqs 5 -max_hsps 1 -out ${outfile} -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sscinames scomnames"
done

blastn • 1.5k views

ADD COMMENT • link updated 12 months ago by Ram 44k • written 12 months ago by neha.jnu463 • 0

Ram · Accepted Answer · 2023-12-25

3

Entering edit mode

12 months ago

Michael 55k

While your code looks ok at first glance you should run your code again with

set -eux

following the SBATCH job control parameters, and then look at the output of the job again.

Also, the following line is going to cause you problems once you run the job more than once:

    outfile=blastn.${f}

could be replaced with:

    outfile=${f}.blastn

As it is now, you are creating blastn output files with filename pattern blastn.*.fasta which will be given as input to blast in the next iteration and result in an error that likely goes unnoticed.

Also, when running under SLURM you should set realistic maximum runtime with SBATCH, e.g.:

   #SBATCH --time=1-12:00:00

Also use multiprocessing options in blastn (-num_threads) with the cluster parameters. You might want to set --cpus-per-task=64 with num_threads over ntasks-per-node.

We don't know the default setting on your cluster so your job may well have been interrupted.

A last optimization step I recommend depending on the layout of your cluster is to submit multiple slightly smaller blastn jobs (around --cpus-per-task=16, num_threads 15 'ish, if that reduces resource allocation time) for a single file each using sbatch in a for loop. You seem to be on a larger cluster using the lustre distributed file system, so IO shouldn't be an issue.

ADD COMMENT • link 12 months ago by Michael 55k

0

Entering edit mode

I used the codes you suggested but still getting the same output file (only the first fasta file in the query). I also tried the script below:

#!/bin/bash
#set -eux
#SBATCH -J rep.euk.genomes
#SBATCH -N 1
#SBATCH --ntasks-per-node=64
#SBATCH -o %x.%j.out
#SBATCH -e %x.%j.err
#SBATCH -p nocona
#SBATCH --export=ALL

DB=/lustre/research/phillips/rep_euk_genomes/db/
B=/lustre/work/sneha/software/ncbi-blast-2.11.0+

infile="$1"
resultfile=blastn.${infile}
$B/bin/blastn -query fasta_not_bear_concat/${infile} -db $DB/ref_euk_rep_genomes -max_target_seqs 5 -max_hsps 1 -outfmt -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sscinames scomnames" > blastn_out_rep_genomes/${resultfile}

I get this error while running this script:

Too many positional arguments (1), the offending value: 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sscinames scomnames

ADD REPLY • link updated 12 months ago by Ram 44k • written 12 months ago by neha.jnu463 • 0

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. If your code has long lines with a single command, break those lines into multiple lines with proper escape sequences so they're easier to read and still run when copy-pasted. I've done it for you this time.
code_formatting

ADD REPLY • link 12 months ago by Ram 44k

0

Entering edit mode

I used the codes you suggested

No, you did not. Neither of the changes suggested by Michael are in your code. ~~Also, why on earth does your SLURM batch file use command line parameters? Are you sure you're using them properly?~~ What is $B in $B/bin/blastn?

ADD REPLY • link 12 months ago by Ram 44k

0

Entering edit mode

I assume the command line argument is for running sbatch in a loop to submit multiple jobs like so: for f in *.fasta ; do sbatch run_it.sh $f; done. Depending on the scheduling policy of the cluster this may be much faster than the original code.

ADD REPLY • link 12 months ago by Michael 55k

0

Entering edit mode

~~That's not an option with slurm. One will need to export a BASH env variable or use a HEREDOC within a script as mentioned here: https://stackoverflow.com/a/44168719/1394178~~

EDIT: I was wrong. SLURM scripts work fine with command line arguments.

ADD REPLY • link 12 months ago by Ram 44k

0

Entering edit mode

That's not an option with slurm.

I have done this several times on Saga and LUMI, they both use Slurm and it worked for me. In my understanding parameters can be passed to job scripts just fine.

ADD REPLY • link 12 months ago by Michael 55k

0

Entering edit mode

Yes, you are correct. I dug deeper after I wrote my comment and discovered I was wrong.

ADD REPLY • link 12 months ago by Ram 44k

0

Entering edit mode

#set -eux

should be

set -eux

This is a shell command and not a slurm parameter. Using it with # just comments it out. Try the following code on one fasta file first and see how that works by looking at the .out and err files generated.

#!/bin/bash

#SBATCH -J rep.euk.genomes
#SBATCH -N 1
#SBATCH --cpus-per-task=16
#SBATCH -o %x.%j.out
#SBATCH -e %x.%j.err
#SBATCH -p nocona
#SBATCH --export=ALL
#SBATCH --time=2-00:00:00

set -eux

DB=/lustre/research/phillips/rep_euk_genomes/db/
B=/lustre/work/sneha/software/ncbi-blast-2.11.0+

infile="$1"
resultfile=${infile}.blastn
$B/bin/blastn -query fasta_not_bear_concat/${infile} -db $DB/ref_euk_rep_genomes \
    -max_target_seqs 5 -max_hsps 1 \
    -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sscinames scomnames" \
    -num_threads 15 > blastn_out_rep_genomes/${resultfile}

ADD REPLY • link 12 months ago by Michael 55k

0

Entering edit mode

Not sure why a > is being used here instead of using -out blastn_out_rep_genomes/${resultfile}.

ADD REPLY • link 12 months ago by GenoMax 148k

0

Entering edit mode

When I run this code I get this error:

/var/spool/slurmd/job12482792/slurm_script: line 16: $1: unbound variable

ADD REPLY • link updated 12 months ago by Ram 44k • written 12 months ago by neha.jnu463 • 0

0

Entering edit mode

When I run

How are you running it? The error message is pretty clear BTW

ADD REPLY • link 12 months ago by Ram 44k

0

Entering edit mode

You have a superfluous -outfmt that causes the error. This looks like an attempt to run a single file per sbatch job in a for loop. You still should change the resultfile=blastn.${infile} to resultfile=${infile}.blastn to avoid problems in the future.

ADD REPLY • link updated 12 months ago by Ram 44k • written 12 months ago by Michael 55k