Challenges with Parallelizing BUSCO for Phylogenetic Tree Construction on Large Genome Dataset
0
0
Entering edit mode
5 hours ago
Mir-Mammad • 0

Hi,

I have a database of over 1,000 genomes, and I would like to run BUSCO to generate a phylogenetic tree. However, I am encountering memory issues during execution, as BUSCO often runs out of memory. I've attempted to parallelize the process by increasing the number of tasks, but parallelization doesn't seem to initiate. According to the logs, BUSCO operates on only one genome at a time, even when I increase the task count.

Below is the script I am currently using. I would appreciate any recommendations or advice on how to resolve this issue.

#SBATCH --partition=node
#SBATCH --nodes=1      
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --mem=32G          
#SBATCH --time=0
#SBATCH --job-name=busco
#SBATCH --output=busco-%j.txt

# Function to process .fna files in the ../input/ncbi directory and its subdirectories
process_fna_files() {
  local base_dir="../input/ncbi"
  local busco_db_path="../arthropoda_odb10" 

  # Find all .fna files in subdirectories of the base directory
  find "$base_dir" -type f -name "*.fna" | while read -r fna_file; do
    # Extract the species name from the subdirectory name (assuming the subdirectory is the species name)
    species_name=$(basename "$(dirname "$fna_file")")

    # Get the current timestamp (YYYY-MM-DD_HH-MM-SS)
    timestamp=$(date +"%Y-%m-%d_%H-%M-%S")

    # Create output directory for the species, appending the timestamp
    output_dir="./output/${species_name}_$timestamp"
    mkdir -p "$output_dir"

    # Run BUSCO using the local arthropoda_odb10 database in genome mode and offline mode
    busco -i "$fna_file" -l "$busco_db_path" -o "$output_dir" -m genome --cpu 20 --offline -f
  done
}

process_fna_files
Busco HPC Slurm Script • 63 views
ADD COMMENT
0
Entering edit mode

A better option is to submit separate SLURM jobs for each genome. Using a for loop inside a single SLURM job is not efficient. Instead do something like (pseudo-code will not run as is):

for i in fasta.fa
do
     sbatch -p partition SLURM-options --wrap="busco -i ${i} -l "$busco_db_path" -o "$output_dir" -m genome --cpu 20 --offline etc."
done

This will submit jobs for all genomes. Depending on allocation allowed for your account a certain number of jobs will start and the rest will pend. Once a job finished a new one will be pulled out of pending jobs until all complete.

ADD REPLY

Login before adding your answer.

Traffic: 1084 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6