Question

Canu Nanopore Genome assembly optimization

0

Entering edit mode

4 months ago

Umer ▴ 160

Dear reader,

I have nanopore genome sequence of fungal samples which I am trying to assemble using canu. Below are the details Canu version 2.2 HPC cluser Note Resources

cores 48 (logical cpu 96)
RAM 256G

my script looks like this

#!/bin/bash
#SBATCH --job-name=Canu_nanopore_assembly
#SBATCH --partition=standard
#SBATCH --output=canu_assembly.out.%j
#SBATCH --error=canu_assembly.err.%j
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=48
#SBATCH --hint=nomultithread
#SBATCH --mem=240G 

# Collect Inputs
type=$1
RawReads=$2
prefix=$3
OutDir=$4
size=$5
Threads=$SLURM_CPUS_PER_TASK
results=$OutDir/$prefix; mkdir -p $OutDir/$prefix

# Run Canu
time canu -${type} $RawReads -p $prefix -d $results/ genomeSize=$size -maxThreads=$Threads useGrid=false -utgReAlign=true -overlapper=mhap stopOnLowCoverage=5

exit

and i run this script via command

# FORMAT => sbatch $CANU $read_type $fasta $prefix $canu_out $genomeSIZE
sbatch canu_test.sh nanopore NP01.fastq NP01 canu_out 60m

The genome assembly took 12 hours and when i checked my job efficiency using command seff <jobID> I got following results

Job ID: 2007928
Cluster: drag
User/Group: $USER
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 96
CPU Utilized: 14-16:11:51
CPU Efficiency: 28.97% of 50-15:32:48 core-walltime
Job Wall-clock time: 12:39:43
Memory Utilized: 40.04 GB
Memory Efficiency: 16.68% of 240.00 GB

I am using useGrid=false option in canu. as per my assessment it will consider the node as a local workstation. this is what i got in log file

-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_362' (from 'java') with -d64 support.
-- Detected gnuplot version '5.2 patchlevel 4   ' (from 'gnuplot') and image format 'png'.
--
-- Detected 48 CPUs and 245760 gigabytes of memory on the local machine.
--
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
--          Slurm disabled by useGrid=false
--
-- Local machine mode enabled; grid support not detected or not allowed.
--
-- Job limits:
--     48 CPUs              (maxThreads option).

Complete Log file here. Canu_log_file

Then i tried again by changing some parameters in canu command and new comamnd is

time canu -${type} $RawReads -p $prefix -d $results/ genomeSize=$size -maxThreads=$Threads --maxMemory=$SLURM_MEM_PER_NODE useGrid=true

I saw the following in the canu log file

-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '22.0.1-internal' (from '/vast/user/tools/miniconda3/envs/canu/lib/jvm/bin/java') without -d64 support.
-- Detected gnuplot version '5.2 patchlevel 4   ' (from 'gnuplot') and image format 'png'.
--
-- Detected 48 CPUs and 245760 gigabytes of memory on the local machine.
--
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with task IDs up to 1000 allowed.
-- 
-- Slurm support detected.  Resources available:
--      1 host  with  80 cores and  753 GB memory.
--      1 host  with 256 cores and  502 GB memory.
--      1 host  with 144 cores and 4029 GB memory.
--     92 hosts with  96 cores and  250 GB memory.
--      4 hosts with 128 cores and 2267 GB memory.
--      8 hosts with  96 cores and  375 GB memory.
--      3 hosts with  64 cores and  250 GB memory.
--     10 hosts with  64 cores and  502 GB memory.
--
-- Job limits:
--   245760 gigabytes memory  (maxMemory option).
--     48 CPUs              (maxThreads option).

i canceled this job in confusion that it is some how using the whole cluster system (i hope i am wrong here)

This is my first ever script which I ran on a cluster. before that i was running all work on a workstation. If i will be getting same time usage for one sample then WHY use a cluster.

Can you guide me which options i can change in canu so it assembles the genome faster and can use more resources.
should i set useGrid-true, will all the execution stay within that one specified node ?

I have to assemble 12 samples and if each one takes 12 hours its more than 6 days.

Your help is greatly appreciated.

slurm assembly genome canu nanopore • 469 views

ADD COMMENT • link updated 4 months ago by Dave Carlson ★ 2.1k • written 4 months ago by Umer ▴ 160

0

Entering edit mode

Using a cluster does not necessarily mean you are going to be done sooner. If your local workstation was using similar resources (e.g. N core and M memory) then it is going to take the same amount of time on the cluster. Advantage with the cluster would be you can start more than one job (within the limits of resources allowed for your account, well administered clusters will not allow a single user to take up all resources for that cluster) so you could be running more than one assembly in parallel (where as you would be able to run only one on the workstation since you max out the resources).

Looking at the following

CPU Efficiency: 28.97% of 50-15:32:48 core-walltime

Memory Efficiency: 16.68% of 240.00 GB

You are clearly not maxing out the resources that are currently allocated so there is may be no obvious bottleneck. Programs will have to take the time to do what they need to. One thing you could do is specifically target your jobs (using the correct partition) to nodes that seem to have more cores/memory, if you are allowed to run jobs there. Adding more cores may not always help since algorithms being used may not be brute force parallelizable/not all steps in the assembly can be run in parallel.

ADD REPLY • link 4 months ago by GenoMax 148k

0

Entering edit mode

thank you for your responce.

Can you also share some clerification regarding usage of option useGrid=true/false ?

ADD REPLY • link 4 months ago by Umer ▴ 160

1

Entering edit mode

useGrid=true will allow Canu to split up the assembly processes into parallel steps that are each submitted as different jobs. In my experience, this greatly increases the speed at which the assembly completes. That said, getting a Canu assembly finished in 12 hours seems really good to me. When I was assembling repetitive plant genomes, it took weeks on a single workstation.

ADD REPLY • link 4 months ago by Dave Carlson ★ 2.1k