CommandLineTool doesn't finish
2
3
Entering edit mode
6.5 years ago
Tom ▴ 540

*Edit for future readers: I assume the problem here was that the tool i was running tried to contact the servers of the manufacturer. It wouldn't continue working unless the connection was successful, but wouldn't indicate what it was trying to do. The --preserve-entire-environment flag of cwltool fixed the problem because it enabled to software to communicate through the network.

Original Questions:

I'm new to cwl and not very experienced in programming in general, so please bear with me.

I have created CommandLineTool which is supposed to use Oxford Nanopore's albacore basecaller to generate .fastq-Files from .fast5 current data.

When using this CommandLineTool, it takes about five minutes to generate all output files in the output directory. However, the cwl-runner job does not end for several more hours. When i invoke the basecaller manually from the command line, using the same parameters i use in the CommandLineTool, it generates the exact same data and then finishes in about 5 minutes.

Could this be caused by a faulty outputs-field? I am still having difficulties understanding how outputs in cwl work. But the fact that the tool takes multiple hours to terminate makes it difficult for me to trace where my mistake lies.

The outputs i am expecting are some report files: "configuration.cfg", "pipeline.log", "sequencing_summary.txt", "sequencing_telemetry.js" As well as a directory called "workspace" containing several subdirectories filled with .fastq-files.

This is the code of the CommandLineTool:

cwlVersion: v1.0
class: CommandLineTool
baseCommand: read_fast5_basecaller.py

inputs:
  input_directory:
    label: |
      Folder of current data in .fast5 format.
    type: Directory
    inputBinding:
      prefix: --input
  worker_threads:
    label: |
      Number of CPU-Cores used for computation.
    type: int
    inputBinding:
      prefix: --worker_threads
  flowcell:
    label: |
      Type of flowcell used in experiment.
    type: string
    inputBinding:
      prefix: --flowcell
  kit:
    label: |
      Type of kit used in experiment.
    type: string
    inputBinding:
      prefix: --kit
  output_directory:
    label: |
      Folder where albacore saves results.
    type: string
    inputBinding:
      prefix: --save_path

outputs:
  sequences:
    type:
      type: array
      items: File
    outputBinding:
      glob: $(inputs.output_directory+"/workspace/pass/*.fasta")
  config:
    type: File
    outputBinding:
      glob: $(inputs.output_directory+"configuration.cfg")
  pipeline:
    type: File
    outputBinding:
      glob: $(inputs.output_directory+"pipeline.log")
  summary:
    type: File
    outputBinding:
      glob: $(inputs.output_directory+"sequencing_summary.txt")
  telemetry:
    type: File
    outputBinding:
      glob: $(inputs.output_directory+"sequencing_telemetry.txt")

Thanks in advance for any help/advice.

edited the post to make it shorter & more comprehensible

cwl • 2.8k views
ADD COMMENT
4
Entering edit mode
6.5 years ago
drkennetz ▴ 560

I figured out a workaround for the hanging, at least on my end (I have albacore installed and have done some MinION runs so I can test in house :)_). I will explain the changes I made so you can hopefully understand. The code:

cwlVersion: v1.0
class: CommandLineTool

inputs:
  input_directory:
    type: Directory
    inputBinding:
      prefix: --input
  worker_threads:
    type: int
    inputBinding:
      prefix: --worker_threads
  flowcell:
    type: string
    inputBinding:
      prefix: --flowcell
  kit:
    type: string
    inputBinding:
      prefix:  --kit
outputs:
  outdir:
    type: Directory
    outputBinding:
      glob: $(runtime.outdir)
baseCommand: read_fast5_basecaller.py
arguments:
 - valueFrom: $(runtime.outdir)
   prefix: --save_path

Some serious differences here. I added an InlineJavascriptRequirement which can be seen at the bottom under arguments. Basically, I got rid of the output_directory input and made it a java argument referencing save_path. So what cwltool actually does behind the scenes is create a temporary directory where it runs all of its operations (all intermediate files and operations are performed in this directory). It is called and can be referenced by $(runtime.tmpdir). An example of this in the portion of code you posted above is:

[job basecallerBioStar.cwl] /tmp/tmpBOQSQ7$ read_fast5_basecaller.py \

You see it is performing your job in this weird tmp/tmpBOQSQ7 dir which cwltool created. When the tool finishes successfully it copies all of the final outputs from runtime.tmpdir to a final directory called $(runtime.outdir) which is what I changed your save_path to. So now, your read_fast5_basecaller.py script is using this value is your final save_path. Then I used another functionality called glob (which is just a unix functionality) to grab everything that is seen in this $(runtime.outdir) directory and place it in your current working directory. Now you may be thinking, "well I don't want all my read_fast5_basecaller.py data to output to the same directory I run the tool from. I also don't want to have to run my cwltool from within the directory I want the output." Fortunately for us, cwltool has a built-in functionality for that called outdir. So the example of my command-line input is the following:

[dkennetz@node albacore]$ cwltool --outdir /path/where/you/want/read_fast5_basecaller/output/ --preserve-entire-environment \
>readFast5.cwl --flowcell FLO-MIN106 --kit SQK-LSK108 --worker_threads 6 --input_directory /path/to/fast5/1/
Resolved 'readFast5.cwl' to 'file:///users/dkennetz/albacore/readFast5.cwl'

[job readFast5.cwl] /tmp/tmpdhpob_gu$ read_fast5_basecaller.py \ --save_path \ /tmp/tmpdhpob_gu \ --flowcell \ FLO-MIN106 \ --input \ /tmp/tmpf7ewqea7/stgc28aa3d7-0f40-4620-9bfc-a895d8f55416/1 \ --kit \ SQK-LSK108 \ --worker_threads \ 4 | 4000 of 4000|##############################################|100% Time: 0:09:32

The tool I wrote completed success and then immediately exited (no hanging).

I hope it works the same for you when you try it tomorrow! This was very fun for me so feel free to contact me if you have any more questions. Dennis

ADD COMMENT
3
Entering edit mode

FYI: InlineJavascriptRequirement is not needed for plain CWL parameter references like $(runtime.outdir) or $(inputs.foo).

See https://www.commonwl.org/v1.0/CommandLineTool.html#Parameter_references & http://www.commonwl.org/user_guide/06-params/

ADD REPLY
1
Entering edit mode

Thanks for the heads up, I'll keep that in mind!

ADD REPLY
0
Entering edit mode

Can you change your answer to reflect this? Thanks!

ADD REPLY
2
Entering edit mode

Thank you so much for all the work you have put into this! Your implementation works for me as well. After some experimentation it seems that --preserve-entire-environment is the crucial part. Even my early attempts with messed up outputs finish if i set this flag (i get errors because of the faulty outputs-field, but they finish).

Not only did you get the basecaller to work with cwl, you also helped me a lot in understanding outputs. So, thanks again. I can start working on the next steps of the workflow.

ADD REPLY
0
Entering edit mode

Happy to help! I think cwl will be an outstanding tool for bioinformatics so I have had a lot of fun learning it. I'm glad to help when I can!

ADD REPLY
0
Entering edit mode

I'm having fun as well, but still struggle with lots of basic stuff. Do you have any recommendations regarding resources to learn cwl? I have worked my way through the CWL User Guide ( https://www.commonwl.org/user_guide/ ). I try consulting the cwl specification at https://www.commonwl.org/v1.0/ when i encounter problems. However, there is still a lot left to learn and i'd like some pointers where else to look.

ADD REPLY
0
Entering edit mode

Sorry it took me a bit to get back to you! I didn't see the message. As far as cwltool goes, I have learned a ton on github. The most typical search I enter is "cwltool bwa" and then in the left hand column click on "code" and just go through any examples you find that might relate to your question. This is all bioinformatics related too so it can be useful!

ADD REPLY
1
Entering edit mode

also if you add:

inputs:
  recursive:
    type: boolean
    default: true
    inputBinding:
      prefix: --recursive

Under the inputs, it will run on all of the fast5 directories under your main fast5 dir so it could stop: [dkennetz@node albacore]$ cwltool --outdir /path/where/you/want/read_fast5_basecaller/output/ --preserve-entire-environment \

readFast5.cwl --flowcell FLO-MIN106 --kit SQK-LSK108 --worker_threads 6 --input_directory /path/to/fast5/

At this point and it will run on every fast5 outdir so it can be a 1 and done and you don't have to run the cwltool 300x or something. While keeping this line in the code you can also run on individual directories using the following:

[dkennetz@node albacore]$ cwltool --outdir /path/where/you/want/read_fast5_basecaller/output/ --preserve-entire-environment \

readFast5.cwl --flowcell FLO-MIN106 --kit SQK-LSK108 --worker_threads 6 --input_directory /path/to/fast5/1/ --recursive false

This will turn off the recursive input. I'm all done now.

ADD REPLY
0
Entering edit mode

Neat! I'm sure this will be helpful in making the final workflow more convenient to use.

ADD REPLY
1
Entering edit mode
6.5 years ago
drkennetz ▴ 560

For your outputs field, if you want cwltool to interpret each output you should point to the input location where you'd expect each output to come from. For example in your code, you are telling cwltool to look for all of these output files from a string input. The output field is not aware the output_directory is an actual directory because all you have specified is that it is a string ( I know you are using this to name your directory, but cwltool does not know it is a directory). You would want each field to correspond with where the output to come from.

You should modify your inputs to expect a script, rather than having your cwltool try to interpret that it should run python read_fast5_basecaller.py with python by doing the following in your inputs/baseCommand:

baseCommand: python
inputs:
  script:
    type: File
    inputBinding:
      position: 1
    default:
      class: File
      location: read_fast5_basecaller.py

also change your output_directory input to reflect that this is creating a directory, not just a string.

output_directory:
  type: Directory
  inputBinding:
    prefix: --save_path

This tells your cwltool exactly how to interpret the script it is reading rather than having it try to figure out on its own. After this you want to point your glob function to this directory for all expected outputs:

outputs:
  sequences:
    type: Directory ## this is the location that you specify as the save path for read_fast5_basecaller.
    outputBinding:
      glob: $(inputs.output_directory)

As an example, say you want an output that you know is generated from the input you've named flowcell (this is a hypothetical) and the output from input flowcell would be a type: File:

outputs:
  flowcell_out:
    type: File
    outputBinding:
      glob: $(inputs.flowcell.basename).txt

This tells cwltool to expect this specific output to come specifically from your input labeled flowcell. This is how you use the inputs as a pointer in a glob function. I hope this fixes your problem! (I think the biggest issue was that you were trying to tell cwltool to expect all outputs to come from a string, and not a directory). You may also have run into some issues because you weren't telling cwltool to expect to run an external script. I hope this is informative! Post again if you have any more questions, I will try to help to save Mr. C some time. If you need me to rewrite the entire script top to bottom I don't mind doing it. Basically, if you leave the outputs field as you have it, it is going to be redundant but useful if you need to use any of those files as a pointer for downstream. If you include everything in your outputs that you currently have, plus the output_directory it will output everything in the output directory + point to each file in the output directory and point to the output directory. If you only output output_directory it will output everything but only point to output_directory in your final pointer. At least I believe this is how it will break down.

ADD COMMENT
0
Entering edit mode

Thank you very much for your very elaborate answer!

Regarding python and the script: read_fast5_basecaller.py seems to function only as an entry point for the whole basecalling process. I cannot get it to work using "python" as my base command in the way you described. Even manually invoking it from the terminal (for example using "python /usr/bin/read_fast5_basecaller.py --help" only makes the script present an exception message instead of the help-text.

Unfortunately i have little knowledge of how exactly the script works. Since its proprietary software made by the company selling the sequencers i'm having difficulties gathering much information. I will try to get help with this from my colleagues tomorrow.

If i'm ignoring outputs by putting the field as outputs: [] the cwl-runner still takes hours to complete after the 6-minute basecalling process has been finished. So can this problem even be related to outputs?

Regarding outputs: I have modified outputs to expect a directory as you described:

outputs:
  sequences:
    type: Directory
    outputBinding:
      glob: $(inputs.output_directory)

The read_fast5_basecaller.py script requires a directory with fast5-files and some information about the experimental setup which was used to generate these fastq-files (in the form of --kit and --flowcell). It will then generate some reports and a folder full of sequence data in different fastq-files. I'm having trouble with the concept of specific outputs being tied to specific inputs. Maybe this just isn't applicable to the specific tool i am using here? I could imagine it making more sense when using something like samtools.

ADD REPLY
1
Entering edit mode

No problem, I want to help you get this working. What type of environment are you running this on? Is it a linux environment that is connected to a cluster, or is it a local desktop? If you have multiple versions of python installed, you could have 1 version of python loaded (say python/2.7.12) and the read_fast5_basecaller.py could be loaded under a different version. Try this for your entire code and let me know how it works (we will forego adding the script as an input although that is good practice):

cwlVersion: v1.0
class: CommandLineTool
baseCommand: read_fast5_basecaller.py
inputs:
  input_directory:
    type: Directory
    inputBinding:
      prefix: --input
  worker_threads:
    type: int
    inputBinding:
      prefix: --worker_threads
  flowcell:
    type: string
    inputBinding:
      prefix: --flowcell
  kit:
    type: string
    inputBinding:
      prefix: --kit
  output_directory:
    type: Directory
    inputBinding:
      prefix: --save_path
outputs:
  all_outfiles:
    type: Directory
    outputBinding:
      glob: $(inputs.output_directory.basename)

If you want to check to see why the tool is running for 6 hours after it seems to complete basecalling run it with debug:

$cwltool --debug input.cwl --flag1 --flag2

where flags are all the inputs your actual cwltool requires. This will print all debug info to the screen so it will tell you exactly what the tool is doing while it seems to be just hanging for those 6 hours. As far as specific outputs related to specific inputs, you will not need to worry about this because all of your outputs will come from the same place. It is more applicable with other tools (like samtools).

ADD REPLY
1
Entering edit mode

Unfortunately the code you provided seems to yield the same result.

Again, the tool seems stuck at this point:

[job basecallerBioStar.cwl] /tmp/tmpBOQSQ7$ read_fast5_basecaller.py \
    --flowcell \
    FLO-MIN107 \
    --input \
    /tmp/tmp6YtzwB/stgab4d9788-6039-49ee-abcb-2e3855014ff7/raw_fast5 \
    --kit \
    SQK-LSK308 \
    --save_path \
    /tmp/tmp6YtzwB/stgd3c41759-db87-41c4-a46a-43c9214d2b8a/outdir \
    --worker_threads \
    16
| 3765 of 3765|#########################################|100% Time: 0:24:48

The last line is the basecallers output. I ran the cwl-runner with the --debug flag. Since its 9:30 pm here i will have to leave soon. I will let the terminal run until everything is finished and then post any additional output it provides tomorrow.

ADD REPLY
0
Entering edit mode

No problem at all, hopefully the debug info is informative as to why it is hanging for so long.

ADD REPLY
0
Entering edit mode

Thanks for the help! I'm currently working in a virtual machine which is hosted by our faculty. The VM is running a ubuntu image which was specifically built for (manually) doing the steps of a basecalling/assembly/polishing workflow. My goal is to automate this exact workflow using cwl. Only python 2.7.12 should be present.

I have been using the --debug command for my last few tries already, but there is now output after the read_fast5_basecaller.py reports 100% completion.

I am running the code you posted right now and will report back when i have a result.

ADD REPLY

Login before adding your answer.

Traffic: 1797 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6