Hi,
I am currently working on a pipeline that runs three bioinformatics tools (deepbgc, bagel, and antismash) on sample genome .fna contig files. The script with the commands for each of the programs looks like the following:
#Set up basename/sample ID
echo 'Sample ID = ' $1
b=$1;
input_dir="/EFS/Analyses/BGC.test/BGC.test2/";
#create directory for each strain:
mkdir /EFS/Analyses/BGC.test/BGC.test2/$b;
dir="/EFS/Analyses/BGC.test/BGC.test2/$b";
#create sub-folders for each strain:
#cd $dir ;
#for i in 00.raw_reads ; do
# if [[ ! -d $i ]] ; then mkdir $i ; fi ;
#done ;
#"1. antiSMASH"
echo "running antismash"
antismash --databases /EFS/database/antismash --genefinding-tool prodigal -c 30 --cb-general --cb-knownclusters --cb-subclusters --asf --pfam2go --smcog-trees --tigrfam --cc-mibig --output-dir $dir/antismash2 $input_dir/$b.fna
#"2. DeepBGC
echo "running DeepBGC"
conda init bash
source /EFS/tools/miniconda/envs/deepbgc2
conda activate deepbgc2
# set hmmer to run on more cpu's
setenv HMMER_NCPU 16
deepbgc pipeline $input_dir/$b.fna -o $dir/Deepbgc
#"3. BAGEL 4"
echo "running BAGEL4"
source /EFS/EnvSetup/Metagenomes-Analysis.sh
/EFS/tools/BAGEL/bagel4_2022/bagel4_wrapper.pl -s $dir/Bagel4 -query $input_dir -r $b.fna
#Run bagel with multi threading
#export BAGEL_NUM_THREADS=16
#export MKL_NUM_THREADS=16
#-r regular expression for identifing files [default=.* meaning all files ]
#Read sample path in EFS directory and extract all relevant input files from every
#sample directory in sample path
python3 /EFS/zgonzalez/Secondary_Metabolites_Pipeline/merged_output.py
echo "Succesfully completed data merge"
Now the issue is that I need to run this pipeline on upwards of 1400 bacterial genomes and I have noticed that the run times for these tools can be rather lengthy to say the least (especially deepbgc and more specifically hmmscan which can take upwards of 20 minutes just to run through a single sample). Now I have done some research into the tools and I couldn't find much in the way of running them on multiple contigs at once. I know there is a way to up the cpu limit of hmmscan but i have tried that and it doesn't seem to speed up run time very significantly. This lead me to believe that running parallel processing on multiple samples at a time may be the way to go. If anyone has familiarity with these tools and could give me some input on the best way to go about running them on a very large set of sample files, that would be much appreciated. Thanks!
If you convert this to a proper pipeline using e.g. nextflow or snakemake it makes it easy to run the pipeline in parallel and asynchronously. You could also use GNU parallel to run your bash code in parallel without much effort.
Hey, Not exactly what you are looking for, but this blog post may help you https://divingintogeneticsandgenomics.com/post/real-life-bioinformatics-skill-deal-with-one-sample-to-a-lot-of-samples/