This must be answered somewhere but I can't find it. How do I process all files matching a pattern, e.g.
/path/to/rawdata/*.dat
I've seen examples where multiple inputs are explicitly listed but where a pattern is specified.
This must be answered somewhere but I can't find it. How do I process all files matching a pattern, e.g.
/path/to/rawdata/*.dat
I've seen examples where multiple inputs are explicitly listed but where a pattern is specified.
For others looking at this thread, this is how I solved the problem. There may be better ways.
Basically, I ran my single file workflow as a sub-workflow in a step that scatters across the input files. I did not attempt to get CWL to scan directories for input files, so I will build the input file as a seperate step.
So, the input looks like:
reads1:
- class: File
format: "edam:format_1930"
location: "../data/S1_R1.fastq.gz"
- class: File
format: "edam:format_1930"
location: "../data/S2_R1.fastq.gz"
reads2:
- class: File
format: "edam:format_1930"
path: "../data/S1_R2.fastq.gz"
- class: File
format: "edam:format_1930"
path: "../data/S2_R2.fastq.gz"
$namespaces: { edam: http://edamontology.org/ }
$schemas: [ http://edamontology.org/EDAM_1.16.owl ]
I then have a scatter workflow:
class: Workflow
cwlVersion: v1.0
requirements:
- class: InlineJavascriptRequirement
- class: ScatterFeatureRequirement
- class: StepInputExpressionRequirement
- class: SubworkflowFeatureRequirement
inputs:
reads1: File[]
reads2: File[]
# I found that with toil some constant input files also need to be reproduced here
adapters:
type: File
default:
class: File
path: /path/to/adapters/TruSeq3-PE.fa
location: /path/to/adapters/TruSeq3-PE.fa
outputs:
[all the workflow outputs here]
steps:
all:
run: single-file-pl.cwl
scatter: [read1, read2]
scatterMethod: dotproduct
in:
read1:
source: reads1
read2:
source: reads2
adapters:
source: adapters
out: [all workflow output here]
Hopefully, this helps someone.
Hello thomas.e,
For CWL implementations that consume YAML/JSON input objects you'll need a separate File
entry for each file.
Here's an example input object, assuming an input named raw_data
and of the type File[]
(also known as type: array, items: File
)
raw_data:
- class: File
path: /path/to/rawdata/000.dat
- class: File
path: /path/to/rawdata/001.dat
I've made an issue to add a convenience feature to the reference implementation to make this easier: https://github.com/common-workflow-language/cwltool/issues/448
I see two scenarios here:
In case of 1. as I mentioned, tool never receives *.dat
as an argument. It actually gets resolved glob, a list of file paths. The way to handle this in CWL is with mini workflow: first you have a tool that would receive a list of files as an input and create semantically meaningful outputs, say dat_files
, meta_files
, other_files
. You do that by specifying globs on outputs of that tool. Next you connect dat_files
to your tool which than receives all file paths on its command line, precisely as if it got invoked with a glob.
In the other case, the best course of action would be to stage input files to working directory, and pass a glob as a string to the tool.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks, good suggestions. I'll look at option 2 - when (if) I get sufficiently proficient at CWL that even the smallest things don't take hours :)