*Edit for future readers: I assume the problem here was that the tool i was running tried to contact the servers of the manufacturer. It wouldn't continue working unless the connection was successful, but wouldn't indicate what it was trying to do. The --preserve-entire-environment
flag of cwltool fixed the problem because it enabled to software to communicate through the network.
Original Questions:
I'm new to cwl and not very experienced in programming in general, so please bear with me.
I have created CommandLineTool which is supposed to use Oxford Nanopore's albacore basecaller to generate .fastq-Files from .fast5 current data.
When using this CommandLineTool, it takes about five minutes to generate all output files in the output directory. However, the cwl-runner job does not end for several more hours. When i invoke the basecaller manually from the command line, using the same parameters i use in the CommandLineTool, it generates the exact same data and then finishes in about 5 minutes.
Could this be caused by a faulty outputs-field? I am still having difficulties understanding how outputs in cwl work. But the fact that the tool takes multiple hours to terminate makes it difficult for me to trace where my mistake lies.
The outputs i am expecting are some report files: "configuration.cfg", "pipeline.log", "sequencing_summary.txt", "sequencing_telemetry.js" As well as a directory called "workspace" containing several subdirectories filled with .fastq-files.
This is the code of the CommandLineTool:
cwlVersion: v1.0
class: CommandLineTool
baseCommand: read_fast5_basecaller.py
inputs:
input_directory:
label: |
Folder of current data in .fast5 format.
type: Directory
inputBinding:
prefix: --input
worker_threads:
label: |
Number of CPU-Cores used for computation.
type: int
inputBinding:
prefix: --worker_threads
flowcell:
label: |
Type of flowcell used in experiment.
type: string
inputBinding:
prefix: --flowcell
kit:
label: |
Type of kit used in experiment.
type: string
inputBinding:
prefix: --kit
output_directory:
label: |
Folder where albacore saves results.
type: string
inputBinding:
prefix: --save_path
outputs:
sequences:
type:
type: array
items: File
outputBinding:
glob: $(inputs.output_directory+"/workspace/pass/*.fasta")
config:
type: File
outputBinding:
glob: $(inputs.output_directory+"configuration.cfg")
pipeline:
type: File
outputBinding:
glob: $(inputs.output_directory+"pipeline.log")
summary:
type: File
outputBinding:
glob: $(inputs.output_directory+"sequencing_summary.txt")
telemetry:
type: File
outputBinding:
glob: $(inputs.output_directory+"sequencing_telemetry.txt")
Thanks in advance for any help/advice.
edited the post to make it shorter & more comprehensible
FYI:
InlineJavascriptRequirement
is not needed for plain CWL parameter references like$(runtime.outdir)
or$(inputs.foo)
.See https://www.commonwl.org/v1.0/CommandLineTool.html#Parameter_references & http://www.commonwl.org/user_guide/06-params/
Thanks for the heads up, I'll keep that in mind!
Can you change your answer to reflect this? Thanks!
Thank you so much for all the work you have put into this! Your implementation works for me as well. After some experimentation it seems that --preserve-entire-environment is the crucial part. Even my early attempts with messed up
outputs
finish if i set this flag (i get errors because of the faulty outputs-field, but they finish).Not only did you get the basecaller to work with cwl, you also helped me a lot in understanding outputs. So, thanks again. I can start working on the next steps of the workflow.
Happy to help! I think cwl will be an outstanding tool for bioinformatics so I have had a lot of fun learning it. I'm glad to help when I can!
I'm having fun as well, but still struggle with lots of basic stuff. Do you have any recommendations regarding resources to learn cwl? I have worked my way through the CWL User Guide ( https://www.commonwl.org/user_guide/ ). I try consulting the cwl specification at https://www.commonwl.org/v1.0/ when i encounter problems. However, there is still a lot left to learn and i'd like some pointers where else to look.
Sorry it took me a bit to get back to you! I didn't see the message. As far as cwltool goes, I have learned a ton on github. The most typical search I enter is "cwltool bwa" and then in the left hand column click on "code" and just go through any examples you find that might relate to your question. This is all bioinformatics related too so it can be useful!
also if you add:
Under the inputs, it will run on all of the fast5 directories under your main fast5 dir so it could stop: [dkennetz@node albacore]$ cwltool --outdir /path/where/you/want/read_fast5_basecaller/output/ --preserve-entire-environment \
At this point and it will run on every fast5 outdir so it can be a 1 and done and you don't have to run the cwltool 300x or something. While keeping this line in the code you can also run on individual directories using the following:
[dkennetz@node albacore]$ cwltool --outdir /path/where/you/want/read_fast5_basecaller/output/ --preserve-entire-environment \
This will turn off the recursive input. I'm all done now.
Neat! I'm sure this will be helpful in making the final workflow more convenient to use.