secondaryFiles where nameroot differs
1
1
Entering edit mode
5.2 years ago
Peter vH ▴ 130

I am trying to write a wrapper for kraken2 and struggling to express the database as a File with secondaryFiles. The structure of the DB is a directory with 3 files: hash.k2d, opts.k2d and taxo.k2d. Because of this structure the typical secondaryFiles format that assumes at least a shared nameroot between the files does not work. However, the specification states that secondaryFiles can be an expression that "must return a filename string relative to the path to the primary File, a File or Directory object with either path or location and basename fields set, or an array consisting of strings or File or Directory objects.".

I thus tried:

#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool
id: kraken2
baseCommand:
  - kraken2
inputs:
  database:
    type: 
      - Directory
      - File
    label: "Kraken 2 DB"
    inputBinding:
      position: 0
      prefix: --db
      valueFrom: $(self.dirname)
    secondaryFiles: |
      ${ 
        let dirname = self.location.split('/').slice(0,-1).join('/');
        return [
          { class: "File", location: dirname + '/opts.k2d' },
          { class: "File", location: dirname + '/taxo.k2d' }
        ]
      }

but this yields the following cwltool error (cwltool version 1.0.20190831161204):

 cwltool kraken2.cwl  input1.yml 
INFO /home/pvh/.virtualenvs/cwltool/bin/cwltool 1.0.20190831161204
INFO Resolved 'kraken2.cwl' to 'file:///home/pvh/Documents/code/SANBI/pvh-forks/bio-cwl-tools/kraken2/kraken2.cwl'
ERROR Got workflow error
Traceback (most recent call last):
  File "/home/pvh/.virtualenvs/cwltool/lib/python3.7/site-packages/cwltool/executors.py", line 168, in run_jobs
    for job in jobiter:
  File "/home/pvh/.virtualenvs/cwltool/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 430, in job
    builder = self._init_job(job_order, runtimeContext)
  File "/home/pvh/.virtualenvs/cwltool/lib/python3.7/site-packages/cwltool/process.py", line 718, in _init_job
    discover_secondaryFiles=getdefault(runtime_context.toplevel, False)))
  File "/home/pvh/.virtualenvs/cwltool/lib/python3.7/site-packages/cwltool/builder.py", line 276, in bind_input
    bindings.extend(self.bind_input(f, datum[f["name"]], lead_pos=lead_pos, tail_pos=f["name"], discover_secondaryFiles=discover_secondaryFiles))
  File "/home/pvh/.virtualenvs/cwltool/lib/python3.7/site-packages/cwltool/builder.py", line 251, in bind_input
    self.bind_input(schema, datum, lead_pos=lead_pos, tail_pos=tail_pos, discover_secondaryFiles=discover_secondaryFiles)
  File "/home/pvh/.virtualenvs/cwltool/lib/python3.7/site-packages/cwltool/builder.py", line 332, in bind_input
    sf_location = datum["location"][0:datum["location"].rindex("/")+1]+sfname
TypeError: can only concatenate str (not "dict") to str
ERROR Workflow error, try again with --debug for more information:
can only concatenate str (not "dict") to str

So I presume the CWL I have is incorrect. Is there a way to specify this structure of files?

And even if this can be done with some form of expression, the specification warns:

"To work on non-filename-preserving storage systems, portable tool descriptions should avoid constructing new values from location, but should construct relative references using basename or nameroot instead."

Yet since dirname is seemingly not available to the expression, I am forced to rely on location - I do not know what a non-filename-preserving storage system (e.g. S3?) would look like from this perspective.

cwl • 1.6k views
ADD COMMENT
2
Entering edit mode
5.2 years ago

Hello Peter vH,

I think there are two separate challenges here:

1) describing a File with secondaryFiles that don't share anything in common name-wise

and

2) staging a File and its secondaries inside a named directory

For (1), on can provide the plain name for each of the desired secondaryFiles using a CWL Expression. (If we put the plain name without using a CWL expression then it is interpreted as extensions to the original filename).

For (2), one can use an CWL Expression to provide the entire listing.

Here's my solution for both:

#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool
baseCommand: kraken2

requirements:
  InlineJavascriptRequirement: {}
  InitialWorkDirRequirement:
    listing:
      - |
        $( {"class": "Directory",
            "basename": "kraken_db",
            "listing": [ inputs.database ]
           })

inputs:
  database:
    type: File
    label: "Kraken 2 DB"
    secondaryFiles:
      - $("opts.k2d")
      - $("taxo.k2d")

arguments:
  - prefix: --db
    valueFrom: "kraken_db"

outputs: []
ADD COMMENT
2
Entering edit mode

Thanks for the clarification about the use of expressions in contrast to bare strings with seoncaryFiles. I ended up with:

inputs:
  database:
    type: 
      - Directory
      - File
    label: "Kraken 2 DB"
    doc: "(either a File refer to the hash.k2d file in the DB or a Directory to reference the entire directory)"
    inputBinding:
      position: 0
      prefix: --db
      valueFrom: |
        ${ return (self.class == "File") ? self.dirname : self.path }
    secondaryFiles:
      - $("opts.k2d")
      - $("taxo.k2d")

which allows the database parameter to be either a File or a Directory. In the case of the File input, it looks like:

database:
    class: File
    path: db/hash.k2d

and in the case of the Directory it is:

database:
    class: Directory
    path: db
ADD REPLY

Login before adding your answer.

Traffic: 2046 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6