Cromwell slurm singularity call caching
0
0
Entering edit mode
8 months ago
asmariyaz23 ▴ 10

Hello

I have been using singularity within cromwell conf and now want to enable call-caching. I do not have internet on any of the cluster nodes. However the call-cache is still turned off. Is it because cromwell cannot pull the docker image (there is not internet on any of the nodes)?.

Here is the cromwell.conf.

    # This line is required. It pulls in default overrides from the embedded cromwell
    # `reference.conf` (in core/src/main/resources) needed for proper performance of cromwell.
include required(classpath("application"))

# Cromwell HTTP server settings
webservice {
  #port = 8000
  #interface = 0.0.0.0
  #binding-timeout = 5s
  #instance.name = "reference"
}

# Cromwell "system" settings
system {
  # If 'true', a SIGINT will trigger Cromwell to attempt to abort all currently running jobs before exiting
  #abort-jobs-on-terminate = false

  # If 'true', a SIGTERM or SIGINT will trigger Cromwell to attempt to gracefully shutdown in server mode,
  # in particular clearing up all queued database writes before letting the JVM shut down.
  # The shutdown is a multi-phase process, each phase having its own configurable timeout. See the Dev Wiki for more details.
  #graceful-server-shutdown = true

  # Cromwell will cap the number of running workflows at N
  #max-concurrent-workflows = 5000

  # Cromwell will launch up to N submitted workflows at a time, regardless of how many open workflow slots exist
  #max-workflow-launch-count = 50

  # Number of seconds between workflow launches
  #new-workflow-poll-rate = 20

  # Since the WorkflowLogCopyRouter is initialized in code, this is the number of workers
  #number-of-workflow-log-copy-workers = 10

  # Default number of cache read workers
  #number-of-cache-read-workers = 25

  io {
    # throttle {
    # # Global Throttling - This is mostly useful for GCS and can be adjusted to match
    # # the quota availble on the GCS API
    # #number-of-requests = 100000
    # #per = 100 seconds
    # }

    # Number of times an I/O operation should be attempted before giving up and failing it.
    #number-of-attempts = 5
  }

  # Maximum number of input file bytes allowed in order to read each type.
  # If exceeded a FileSizeTooBig exception will be thrown.
  input-read-limits {
    #lines = 128000
    #bool = 7
    #int = 19
    #float = 50
    #string = 128000
    #json = 128000
    #tsv = 128000
    #map = 128000
    #object = 128000
  }

  abort {
    # These are the default values in Cromwell, in most circumstances there should not be a need to change them.

    # How frequently Cromwell should scan for aborts.
    scan-frequency: 30 seconds

    # The cache of in-progress aborts. Cromwell will add entries to this cache once a WorkflowActor has been messaged to abort.
    # If on the next scan an 'Aborting' status is found for a workflow that has an entry in this cache, Cromwell will not ask
    # the associated WorkflowActor to abort again.
    cache {
      enabled: true
      # Guava cache concurrency.
      concurrency: 1
      # How long entries in the cache should live from the time they are added to the cache.
      ttl: 20 minutes
      # Maximum number of entries in the cache.
      size: 100000
    }
  }

  # Cromwell reads this value into the JVM's `networkaddress.cache.ttl` setting to control DNS cache expiration
  dns-cache-ttl: 3 minutes
}

docker {
  hash-lookup {
    # Set this to match your available quota against the Google Container Engine API
    #gcr-api-queries-per-100-seconds = 1000

    # Time in minutes before an entry expires from the docker hashes cache and needs to be fetched again
    #cache-entry-ttl = "20 minutes"

    # Maximum number of elements to be kept in the cache. If the limit is reached, old elements will be removed from the cache
    #cache-size = 200

    # How should docker hashes be looked up. Possible values are "local" and "remote"
    # "local": Lookup hashes on the local docker daemon using the cli
    # "remote": Lookup hashes on docker hub, gcr, gar, quay
    #method = "remote"
    enabled = "false"
  }
}

# Here is where you can define the backend providers that Cromwell understands.
# The default is a local provider.
# To add additional backend providers, you should copy paste additional backends
# of interest that you can find in the cromwell.example.backends folder
# folder at https://www.github.com/broadinstitute/cromwell
# Other backend providers include SGE, SLURM, Docker, udocker, Singularity. etc.
# Don't forget you will need to customize them for your particular use case.
backend {
  # Override the default backend.
  default = slurm

  # The list of providers.
  providers {
    # Copy paste the contents of a backend provider in this section
    # Examples in cromwell.example.backends include:
    # LocalExample: What you should use if you want to define a new backend provider
    # AWS: Amazon Web Services
    # BCS: Alibaba Cloud Batch Compute
    # TES: protocol defined by GA4GH
    # TESK: the same, with kubernetes support
    # Google Pipelines, v2 (PAPIv2)
    # Docker
    # Singularity: a container safe for HPC
    # Singularity+Slurm: and an example on Slurm
    # udocker: another rootless container solution
    # udocker+slurm: also exemplified on slurm
    # HtCondor: workload manager at UW-Madison
    # LSF: the Platform Load Sharing Facility backend
    # SGE: Sun Grid Engine
    # SLURM: workload manager

    # Note that these other backend examples will need tweaking and configuration.
    # Please open an issue https://www.github.com/broadinstitute/cromwell if you have any questions
    slurm {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        # Root directory where Cromwell writes job results in the container. This value
        # can be used to specify where the execution folder is mounted in the container.
        # it is used for the construction of the docker_cwd string in the submit-docker
        # value above.
        dockerRoot = "/cromwell-executions"

        concurrent-job-limit = 10
        # If an 'exit-code-timeout-seconds' value is specified:
        #     - check-alive will be run at this interval for every job
        #     - if a job is found to be not alive, and no RC file appears after this interval
        #     - Then it will be marked as Failed.
        ## Warning: If set, Cromwell will run 'check-alive' for every job at this interval
        exit-code-timeout-seconds = 360
        filesystems {
         local {
           localization: [
             # soft link does not work for docker with --contain. Hard links won't work
             # across file systems
             "copy", "hard-link", "soft-link"
           ]
            caching {
                  duplication-strategy: ["copy", "hard-link", "soft-link"]
                  hashing-strategy: "xxh64"
            }
         }
        }



        #
        runtime-attributes = """
        Int runtime_minutes = 600
        Int cpus = 2
        Int requested_memory_mb_per_core = 30000
        String? docker
        String? account
        String? IMAGE
        """

        submit = """
            sbatch \
              --wait \
              --job-name=${job_name} \
              --chdir=${cwd} \
              --output=${out} \
              --error=${err} \
              --time=${runtime_minutes} \
              ${"--cpus-per-task=" + cpus} \
              --mem-per-cpu=${requested_memory_mb_per_core} \
              --account=${account} \
              --wrap "/bin/bash ${script}"
        """

        submit-docker = """
            # SINGULARITY_CACHEDIR needs to point to a directory accessible by
            # the jobs (i.e. not lscratch). Might want to use a workflow local
            # cache dir like in run.sh
            source /scratch/asmab/set_singularity_cachedir.sh
            SINGULARITY_CACHEDIR=/scratch/asmab/singularity-cache
            echo "SINGULARITY_CACHEDIR $SINGULARITY_CACHEDIR"
            if [ -z $SINGULARITY_CACHEDIR ]; then
                CACHE_DIR=$HOME/.singularity
            else
                CACHE_DIR=$SINGULARITY_CACHEDIR
            fi
            mkdir -p $CACHE_DIR
                        echo "SINGULARITY_CACHEDIR $SINGULARITY_CACHEDIR"
            LOCK_FILE=$CACHE_DIR/singularity_pull_flock

            # we want to avoid all the cromwell tasks hammering each other trying
            # to pull the container into the cache for the first time. flock works
            # on GPFS, netapp, and vast (of course only for processes on the same
            # machine which is the case here since we're pulling it in the master
            # process before submitting).
            #flock --exclusive --timeout 1200 $LOCK_FILE \
            #    singularity exec --containall docker://${docker} \
            #    echo "successfully pulled ${docker}!" &> /dev/null

            # Ensure singularity is loaded if it's installed as a module
            #module load Singularity/3.0.1
            module load apptainer/1.2.4
            # Build the Docker image into a singularity image
            IMAGE=${docker}.sif
            apptainer build $IMAGE docker://${docker}

            # Submit the script to SLURM
            sbatch \
              --wait \
              --job-name=${job_name} \
              --chdir=${cwd} \
              --output=${cwd}/execution/stdout \
              --error=${cwd}/execution/stderr \
              --time=${runtime_minutes} \
              ${"--cpus-per-task=" + cpus} \
              --mem-per-cpu=${requested_memory_mb_per_core} \
              --account=${account} \
              --wrap "apptainer exec --containall --bind ${cwd}:${docker_cwd} ${IMAGE} ${job_shell} ${docker_script}"
        """

        kill = "scancel ${job_id}"
        check-alive = "squeue -j ${job_id}"
        job-id-regex = "Submitted batch job (\\d+).*"
      }
    }

  }
}

call-caching {
    enabled = true
    invalidate-bad-cache-results = true
  }
cromwell singularity wdl • 368 views
ADD COMMENT

Login before adding your answer.

Traffic: 2696 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6