Hello,
If I have a command line tool like this:
java -Xmx55g -Xms55g -jar /a/apps/picard/picard-tools-2.4.1/picard.jar ExtractIlluminaBarcodes \
INPUT_BASECALLS_DIR=/150313_D00282_0057_BC6FF5ANXX/Data/Intensities/BaseCalls \
INPUT_BARCODE_FILE=/barcodes/barcode1.txt \
OUTPUT_METRICS_FILE=/barcodes/barcode1.metrics
In CWL, should I code the above INPUTs and OUTPUT as type 'Directory' and 'File'? Or should I code those as type 'string'? What would be the difference in behavior of CWL during execution? What are the pros and cons?
I seem to be having better luck with getting it to run in CWL as type 'string' below. And even though 'OUTPUT_METRICS_FILE=' designates the name of the File to be output, I put it in the 'inputs' section as just another parameter 'string', and that seems to work okay. Please help us to think clearer about this. Thanks!!
cwlVersion: v1.0
class: CommandLineTool
baseCommand: java
inputs:
- id: basecalls_dir
type: string
inputBinding:
position: 5
separate: false
prefix: "INPUT_BASECALLS_DIR="
- id: barcode_file
#type: File
type: string
inputBinding:
position: 8
separate: false
prefix: "INPUT_BARCODE_FILE="
- id: metrics_file
type: string
inputBinding:
position: 10
separate: false
prefix: "OUTPUT_METRICS_FILE="
There are tool descriptions for some of the picard tools in the CWL repository:
https://github.com/common-workflow-language/workflows/tree/master/tools
Nothing with directory inputs though.
I agree with StarvingMarvin's answer in general; Files and Directories for flexibility. Still, my experience with CWL directories is mixed. As I understand it when a Directory is initialized, the first thing that happens is that the implementation parses it and all subfolders to list all fiiles. The basecall dir can be rather large, depending on your instrument. I'm not sure when CWL decides it necessary to copy a Directory or File, but you might not have that happen to the basecall directory with the same motivation as before.
Thanks StarvingMarivn and karl.nordstrom for your replies.
Yes, when I run the CommandLine Tool above with the inputs as Directory and File, cwl-runner looks like it's running via cpu and memory usage, but no output nor error message was produced after a long while. But when I run the Tool with inputs as 'string', then everything works as if I was running on shell command line.
The basecalls_dir is from Illumina whole genome sequencing, so it's very big, 250 GB, with lots of folder levels and files.