(Experienced WDL programmers may find the question below too simple to be real. If so, let me stress that I am just now getting started with WDLs, and that I really don't know how one would tackle problems like the one described below using WDLs.)
Just as an example, suppose I have a Python script /path/to/foo.py
that takes, say, one command-line argument, and, upon successful completion, writes out a file to disk(1).
I want to invoke this script N times. Since I have at my disposal M cores (where N ≫ M > 1), I would like to run approximately N/M of these invocations on each one of these cores.
This general problem structure is what I am here calling "embarrassingly parallel".
If I were to accomplish this task using, say, an LSF cluster, I would create M one-time shell scripts, each containing the command lines, one after the other, for approximately N/M of the desired invocations of /path/to/foo.py
, each with its corresponding argument, and would then submit these M one-time scripts as separate (and hopefully concurrently-running) LSF jobs. Each of these M jobs would then run its share of N/M invocations of /path/to/foo.py
, sequentially, though, ideally, the M jobs would be running in parallel with each other.
My question is: how does one implement the same workflow using a WDL (to be fed to Cromwell) instead of an LSF cluster?
More specifically, what is the general structure of a WDL that implements such embarrassing parallelization?
(1) In case it matters, we may assume that the name of this file depends on the argument, so that runs of the script with different arguments will result in differently-named files.
Thank you! What I'm not yet clear on is the following: if I need to run, say, ten billion small, independent tasks, that can, in principle, all be run in parallel, will Cromwell run ten billion independent jobs (somehow) and process each task in its own job, or will it split the ten billion tasks into, say, 100 subsets, and process these subsets in 100 parallel jobs?
What I'm trying to get at here is that there is overhead in spinning up a core to carry out some work, and therefore, it would be wasteful to do so for one short(1) task. It would instead, make more sense to use such a core to process a whole batch (say 100 million) of those small tasks serially. I wonder if engines like Cromwell already implement such batching strategies on their own, so that one needs only to tell them what needs to be done at the most granular level.
(1) In fact, the overhead of spinning a core up and down is the yardstick I have in mind when I say that a task is "short."
I have no knowledge about WDL and it's executors, but 10 billion tasks to me sounds like a problem that you indeed should implement differently in several ways:
It will be next to impossible to write 10 billion files to disk: ext4 and NTFS have a limit of 2^32 - 1 files (4,294,967,295) and Fat32 just 2^16 - 1 (65,535). You could use object store, but better write the records into one file (Apache Parquet?) or database.
For processing this many records separately, I fear no workflow executor is suited - neither WDL nor Nextflow or Snakemake. They are meant to parallelize dozens to hundreds of CPU-heavy tasks e.g. on a per-sample basis, but not billions of single records like individual reads. For this, you should either use columnar data frame tools like Pandas or Polars or a stream-processing tools like Apache Flink.
Actually, in reply to my own comment, I now realize that the question it poses is silly, and not only because, as Matthias Zepper pointed out, the numbers I used by way of example are a bit too ridiculous, but also because it is obvious that Cromwell has no way of knowing that the tasks I'm describing as "small" are indeed small, even if one can make this term reasonably objective (to human readers) by invoking the overhead of spinning up and down a node as yardstick. In other words, it is simply silly to expect that Cromwell could determine a minimum batch size M that it would be worth spinning up and down a core for. Therefore, there is no reasonable way for Cromwell to automatically implement the batching strategy I described in my original post.