Entering edit mode
5 hours ago
Mir-Mammad
•
0
Hi,
I have a database of over 1,000 genomes, and I would like to run BUSCO to generate a phylogenetic tree. However, I am encountering memory issues during execution, as BUSCO often runs out of memory. I've attempted to parallelize the process by increasing the number of tasks, but parallelization doesn't seem to initiate. According to the logs, BUSCO operates on only one genome at a time, even when I increase the task count.
Below is the script I am currently using. I would appreciate any recommendations or advice on how to resolve this issue.
#SBATCH --partition=node
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --mem=32G
#SBATCH --time=0
#SBATCH --job-name=busco
#SBATCH --output=busco-%j.txt
# Function to process .fna files in the ../input/ncbi directory and its subdirectories
process_fna_files() {
local base_dir="../input/ncbi"
local busco_db_path="../arthropoda_odb10"
# Find all .fna files in subdirectories of the base directory
find "$base_dir" -type f -name "*.fna" | while read -r fna_file; do
# Extract the species name from the subdirectory name (assuming the subdirectory is the species name)
species_name=$(basename "$(dirname "$fna_file")")
# Get the current timestamp (YYYY-MM-DD_HH-MM-SS)
timestamp=$(date +"%Y-%m-%d_%H-%M-%S")
# Create output directory for the species, appending the timestamp
output_dir="./output/${species_name}_$timestamp"
mkdir -p "$output_dir"
# Run BUSCO using the local arthropoda_odb10 database in genome mode and offline mode
busco -i "$fna_file" -l "$busco_db_path" -o "$output_dir" -m genome --cpu 20 --offline -f
done
}
process_fna_files
A better option is to submit separate SLURM jobs for each genome. Using a
for
loop inside a single SLURM job is not efficient. Instead do something like (pseudo-code will not run as is):This will submit jobs for all genomes. Depending on allocation allowed for your account a certain number of jobs will start and the rest will pend. Once a job finished a new one will be pulled out of
pending
jobs until all complete.