Question

Challenges with Parallelizing BUSCO for Phylogenetic Tree Construction on Large Genome Dataset

0

Entering edit mode

6 weeks ago

Mir-Mammad • 0

Hi,

I have a database of over 1,000 genomes, and I would like to run BUSCO to generate a phylogenetic tree. However, I am encountering memory issues during execution, as BUSCO often runs out of memory. I've attempted to parallelize the process by increasing the number of tasks, but parallelization doesn't seem to initiate. According to the logs, BUSCO operates on only one genome at a time, even when I increase the task count.

Below is the script I am currently using. I would appreciate any recommendations or advice on how to resolve this issue.

#SBATCH --partition=node
#SBATCH --nodes=1      
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --mem=32G          
#SBATCH --time=0
#SBATCH --job-name=busco
#SBATCH --output=busco-%j.txt

# Function to process .fna files in the ../input/ncbi directory and its subdirectories
process_fna_files() {
  local base_dir="../input/ncbi"
  local busco_db_path="../arthropoda_odb10" 

  # Find all .fna files in subdirectories of the base directory
  find "$base_dir" -type f -name "*.fna" | while read -r fna_file; do
    # Extract the species name from the subdirectory name (assuming the subdirectory is the species name)
    species_name=$(basename "$(dirname "$fna_file")")

    # Get the current timestamp (YYYY-MM-DD_HH-MM-SS)
    timestamp=$(date +"%Y-%m-%d_%H-%M-%S")

    # Create output directory for the species, appending the timestamp
    output_dir="./output/${species_name}_$timestamp"
    mkdir -p "$output_dir"

    # Run BUSCO using the local arthropoda_odb10 database in genome mode and offline mode
    busco -i "$fna_file" -l "$busco_db_path" -o "$output_dir" -m genome --cpu 20 --offline -f
  done
}

process_fna_files

Busco HPC Slurm • 483 views

ADD COMMENT • link updated 6 weeks ago by Ram 44k • written 6 weeks ago by Mir-Mammad • 0

2

Entering edit mode

A better option is to submit separate SLURM jobs for each genome. Using a for loop inside a single SLURM job is not efficient. Instead do something like (pseudo-code will not run as is):

for i in fasta.fa
do
     sbatch -p partition SLURM-options --wrap="busco -i ${i} -l "$busco_db_path" -o "$output_dir" -m genome --cpu 20 --offline etc."
done

This will submit jobs for all genomes. Depending on allocation allowed for your account a certain number of jobs will start and the rest will pend. Once a job finished a new one will be pulled out of pending jobs until all complete.

ADD REPLY • link 6 weeks ago by GenoMax 147k

0

Entering edit mode

Thanks a lot!

ADD REPLY • link 6 weeks ago by Mir-Mammad • 0

1

Entering edit mode

This is not a direct answer to your original question, but rather an advice.

I can't look through a tree that has more than 150-200 entries because it becomes too complex. It can't fit on a page or a computer screen. Do you really need 1000 entries? It is a safe assumption that many genomes in your list have average nucleotide identity that is >=95%. If you have some strains of the same species, they are probably >=99% identical. All those entries are guaranteed to be next to each other in the tree, so it really doesn't matter whether you build a tree to prove it or not. I suggest you narrow down your list to truly different genomes, and it probably won't be so challenging.

Once you build a smaller reference tree, additional sequences can be added to it using pplacer or a similar tool:

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-538

ADD REPLY • link 6 weeks ago by Mensur Dlakic ★ 28k

0

Entering edit mode

I will use the 1600+ species tree as a supplementary graph. The actual graph will have some branches collapsed for similar species, as I want to highlight clades with specific miRNA changes. The pplacer tool looks interesting, and I will definitely look into it!

ADD REPLY • link 6 weeks ago by Mir-Mammad • 0