Dear reader,
I have nanopore genome sequence of fungal samples which I am trying to assemble using canu. Below are the details Canu version 2.2 HPC cluser Note Resources
- cores 48 (logical cpu 96)
- RAM 256G
my script
looks like this
#!/bin/bash
#SBATCH --job-name=Canu_nanopore_assembly
#SBATCH --partition=standard
#SBATCH --output=canu_assembly.out.%j
#SBATCH --error=canu_assembly.err.%j
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=48
#SBATCH --hint=nomultithread
#SBATCH --mem=240G
# Collect Inputs
type=$1
RawReads=$2
prefix=$3
OutDir=$4
size=$5
Threads=$SLURM_CPUS_PER_TASK
results=$OutDir/$prefix; mkdir -p $OutDir/$prefix
# Run Canu
time canu -${type} $RawReads -p $prefix -d $results/ genomeSize=$size -maxThreads=$Threads useGrid=false -utgReAlign=true -overlapper=mhap stopOnLowCoverage=5
exit
and i run this script via command
# FORMAT => sbatch $CANU $read_type $fasta $prefix $canu_out $genomeSIZE
sbatch canu_test.sh nanopore NP01.fastq NP01 canu_out 60m
The genome assembly took 12 hours and when i checked my job efficiency using command seff <jobID>
I got following results
Job ID: 2007928
Cluster: drag
User/Group: $USER
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 96
CPU Utilized: 14-16:11:51
CPU Efficiency: 28.97% of 50-15:32:48 core-walltime
Job Wall-clock time: 12:39:43
Memory Utilized: 40.04 GB
Memory Efficiency: 16.68% of 240.00 GB
I am using useGrid=false
option in canu. as per my assessment it will consider the node as a local workstation. this is what i got in log
file
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_362' (from 'java') with -d64 support.
-- Detected gnuplot version '5.2 patchlevel 4 ' (from 'gnuplot') and image format 'png'.
--
-- Detected 48 CPUs and 245760 gigabytes of memory on the local machine.
--
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Slurm disabled by useGrid=false
--
-- Local machine mode enabled; grid support not detected or not allowed.
--
-- Job limits:
-- 48 CPUs (maxThreads option).
Complete Log file here. Canu_log_file
Then i tried again by changing some parameters in canu command and new comamnd is
time canu -${type} $RawReads -p $prefix -d $results/ genomeSize=$size -maxThreads=$Threads --maxMemory=$SLURM_MEM_PER_NODE useGrid=true
I saw the following in the canu log file
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '22.0.1-internal' (from '/vast/user/tools/miniconda3/envs/canu/lib/jvm/bin/java') without -d64 support.
-- Detected gnuplot version '5.2 patchlevel 4 ' (from 'gnuplot') and image format 'png'.
--
-- Detected 48 CPUs and 245760 gigabytes of memory on the local machine.
--
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with task IDs up to 1000 allowed.
--
-- Slurm support detected. Resources available:
-- 1 host with 80 cores and 753 GB memory.
-- 1 host with 256 cores and 502 GB memory.
-- 1 host with 144 cores and 4029 GB memory.
-- 92 hosts with 96 cores and 250 GB memory.
-- 4 hosts with 128 cores and 2267 GB memory.
-- 8 hosts with 96 cores and 375 GB memory.
-- 3 hosts with 64 cores and 250 GB memory.
-- 10 hosts with 64 cores and 502 GB memory.
--
-- Job limits:
-- 245760 gigabytes memory (maxMemory option).
-- 48 CPUs (maxThreads option).
i canceled this job in confusion that it is some how using the whole cluster system (i hope i am wrong here)
This is my first ever script which I ran on a cluster. before that i was running all work on a workstation. If i will be getting same time usage for one sample then WHY use a cluster.
- Can you guide me which options i can change in
canu
so it assembles the genome faster and can use more resources. - should i set
useGrid-true
, will all the execution stay within that one specified node ?
I have to assemble 12 samples and if each one takes 12 hours its more than 6 days.
Your help is greatly appreciated.
Using a cluster does not necessarily mean you are going to be done sooner. If your local workstation was using similar resources (e.g. N core and M memory) then it is going to take the same amount of time on the cluster. Advantage with the cluster would be you can start more than one job (within the limits of resources allowed for your account, well administered clusters will not allow a single user to take up all resources for that cluster) so you could be running more than one assembly in parallel (where as you would be able to run only one on the workstation since you max out the resources).
Looking at the following
You are clearly not maxing out the resources that are currently allocated so there is may be no obvious bottleneck. Programs will have to take the time to do what they need to. One thing you could do is specifically target your jobs (using the correct partition) to nodes that seem to have more cores/memory, if you are allowed to run jobs there. Adding more cores may not always help since algorithms being used may not be brute force parallelizable/not all steps in the assembly can be run in parallel.
thank you for your responce.
Can you also share some clerification regarding usage of option
useGrid=true/false
?useGrid=true
will allow Canu to split up the assembly processes into parallel steps that are each submitted as different jobs. In my experience, this greatly increases the speed at which the assembly completes. That said, getting a Canu assembly finished in 12 hours seems really good to me. When I was assembling repetitive plant genomes, it took weeks on a single workstation.