Question

Parallel processing of bioinformatics pipeline - running deepbgc, bagel, and antismash tools on multiple samples at once to speed up run time?

0

Entering edit mode

2.2 years ago

Zach G • 0

Hi,

I am currently working on a pipeline that runs three bioinformatics tools (deepbgc, bagel, and antismash) on sample genome .fna contig files. The script with the commands for each of the programs looks like the following:

#Set up basename/sample ID
echo 'Sample ID = ' $1
b=$1;

input_dir="/EFS/Analyses/BGC.test/BGC.test2/";

#create directory for each strain:

mkdir /EFS/Analyses/BGC.test/BGC.test2/$b;
dir="/EFS/Analyses/BGC.test/BGC.test2/$b";

#create sub-folders for each strain:
#cd $dir ;
#for i in 00.raw_reads  ; do
 #  if [[ ! -d $i ]] ; then mkdir $i ; fi ;
#done ;

#"1. antiSMASH"
echo "running antismash"
antismash --databases /EFS/database/antismash --genefinding-tool prodigal -c 30 --cb-general --cb-knownclusters --cb-subclusters --asf --pfam2go --smcog-trees --tigrfam --cc-mibig --output-dir $dir/antismash2 $input_dir/$b.fna

#"2. DeepBGC
echo "running DeepBGC"
conda init bash
source /EFS/tools/miniconda/envs/deepbgc2
conda activate deepbgc2
# set hmmer to run on more cpu's
setenv HMMER_NCPU 16
deepbgc pipeline $input_dir/$b.fna -o $dir/Deepbgc


#"3. BAGEL 4"
echo "running BAGEL4"
source /EFS/EnvSetup/Metagenomes-Analysis.sh
/EFS/tools/BAGEL/bagel4_2022/bagel4_wrapper.pl -s $dir/Bagel4 -query $input_dir -r $b.fna

#Run bagel with multi threading 

#export BAGEL_NUM_THREADS=16
#export MKL_NUM_THREADS=16

#-r regular expression for identifing files [default=.* meaning all files ]

#Read sample path in EFS directory and extract all relevant input files from every 
#sample directory in sample path
python3 /EFS/zgonzalez/Secondary_Metabolites_Pipeline/merged_output.py
echo "Succesfully completed data merge"

Now the issue is that I need to run this pipeline on upwards of 1400 bacterial genomes and I have noticed that the run times for these tools can be rather lengthy to say the least (especially deepbgc and more specifically hmmscan which can take upwards of 20 minutes just to run through a single sample). Now I have done some research into the tools and I couldn't find much in the way of running them on multiple contigs at once. I know there is a way to up the cpu limit of hmmscan but i have tried that and it doesn't seem to speed up run time very significantly. This lead me to believe that running parallel processing on multiple samples at a time may be the way to go. If anyone has familiarity with these tools and could give me some input on the best way to go about running them on a very large set of sample files, that would be much appreciated. Thanks!

bash parallel-processing deeptools • 1.8k views

ADD COMMENT • link updated 4 months ago by Ming Tommy Tang ★ 4.5k • written 2.2 years ago by Zach G • 0

2

Entering edit mode

If you convert this to a proper pipeline using e.g. nextflow or snakemake it makes it easy to run the pipeline in parallel and asynchronously. You could also use GNU parallel to run your bash code in parallel without much effort.

ADD REPLY • link 2.2 years ago by rpolicastro 13k

0

Entering edit mode

Hey, Not exactly what you are looking for, but this blog post may help you https://divingintogeneticsandgenomics.com/post/real-life-bioinformatics-skill-deal-with-one-sample-to-a-lot-of-samples/

ADD REPLY • link 4 months ago by Ming Tommy Tang ★ 4.5k

score 1 · Answer 1 · 2022-10-13

Learn this paradigm:

#!/bin/bash
#Define function(s)
function doSomething() {
    program1 -in "$1" > "$1".resultFromProgram1
    program2 -in "$1".resultFromProgram1 > "$1".resultFromProgram2
}
#How many parallel jobs, note that if your function(s) use more than one thread you want to adjust here
export THREADS="32"
#Export your function(s)
export -f doSomething
#Start executing jobs in parallel
find /place/with/input/files -maxdepth 1 -type f -name "*.whatever" \
    | parallel -j "$THREADS" doSomething {}

If you have lots of RAM, write and read to/from /dev/shm rather than to/from disk. This can help immensely with i/o bottlenecks. Also, know your programs. Some write temp files to disk (in parallel execution they may even be mixed). You may need to cd to /dev/shm/whatever before executing. You may even need to create separate dirs there for your files. Learn to think about the flow of data. Some steps may make more sense to process sequentially with max threads whereas others make more sense to process in parallel with one thread each. Etc.

score 0 · Answer 2 · 2024-08-07

0

Entering edit mode

4 months ago

shert1 • 0

Zach G can I ask how did you get Bagel4 to work?

ADD COMMENT • link 4 months ago by shert1 • 0