Entering edit mode
11 weeks ago
Mir-Mammad
•
0
Hi,
I have a database of over 1,000 genomes, and I would like to run BUSCO to generate a phylogenetic tree. However, I am encountering memory issues during execution, as BUSCO often runs out of memory. I've attempted to parallelize the process by increasing the number of tasks, but parallelization doesn't seem to initiate. According to the logs, BUSCO operates on only one genome at a time, even when I increase the task count.
Below is the script I am currently using. I would appreciate any recommendations or advice on how to resolve this issue.
#SBATCH --partition=node
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --mem=32G
#SBATCH --time=0
#SBATCH --job-name=busco
#SBATCH --output=busco-%j.txt
# Function to process .fna files in the ../input/ncbi directory and its subdirectories
process_fna_files() {
local base_dir="../input/ncbi"
local busco_db_path="../arthropoda_odb10"
# Find all .fna files in subdirectories of the base directory
find "$base_dir" -type f -name "*.fna" | while read -r fna_file; do
# Extract the species name from the subdirectory name (assuming the subdirectory is the species name)
species_name=$(basename "$(dirname "$fna_file")")
# Get the current timestamp (YYYY-MM-DD_HH-MM-SS)
timestamp=$(date +"%Y-%m-%d_%H-%M-%S")
# Create output directory for the species, appending the timestamp
output_dir="./output/${species_name}_$timestamp"
mkdir -p "$output_dir"
# Run BUSCO using the local arthropoda_odb10 database in genome mode and offline mode
busco -i "$fna_file" -l "$busco_db_path" -o "$output_dir" -m genome --cpu 20 --offline -f
done
}
process_fna_files
A better option is to submit separate SLURM jobs for each genome. Using a
for
loop inside a single SLURM job is not efficient. Instead do something like (pseudo-code will not run as is):This will submit jobs for all genomes. Depending on allocation allowed for your account a certain number of jobs will start and the rest will pend. Once a job finished a new one will be pulled out of
pending
jobs until all complete.Thanks a lot!
This is not a direct answer to your original question, but rather an advice.
I can't look through a tree that has more than 150-200 entries because it becomes too complex. It can't fit on a page or a computer screen. Do you really need 1000 entries? It is a safe assumption that many genomes in your list have average nucleotide identity that is >=95%. If you have some strains of the same species, they are probably >=99% identical. All those entries are guaranteed to be next to each other in the tree, so it really doesn't matter whether you build a tree to prove it or not. I suggest you narrow down your list to truly different genomes, and it probably won't be so challenging.
Once you build a smaller reference tree, additional sequences can be added to it using
pplacer
or a similar tool:https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-538
I will use the 1600+ species tree as a supplementary graph. The actual graph will have some branches collapsed for similar species, as I want to highlight clades with specific miRNA changes. The pplacer tool looks interesting, and I will definitely look into it!