I'm trying to set up a workflow with CWL and am struggling to figure out how to place the output files generated by the different steps in the workflow into their own directories. As it stands, all output files are put in the same directory, creating a lot of clutter. I would very much prefer to have different sub-directories for each step.
Right now I have something like this:
working directory
-- fastq
-- read_1.fq
-- read_2.fq
...
-- output # <-- all output files are dumped here
What I want is something like this:
working directory
-- fastq
-- sample1_read_1.fq
-- sample1_read_2.fq
-- sample2_read_1.fq
-- sample2_read_2.fq
...
-- trimmed
-- sample1_read_1.trimmed.fq
-- sample1_read_2.trimmed.fq
-- sample2_read_1.trimmed.fq
-- sample2_read_2.trimmed.fq
...
-- bam
-- sample1.bam
-- sample2.bam
...
-- qc
-- QC reports
...
I can set up the individual command line tools to write their output to a directory but the only way I have found to make that directory show up in the output, as indicated above, is to designate the whole directory as the output of that tool. While that would produce the directory structure, I want it introduces two problems. Firstly, it makes it harder to access individual files that are required for a subsequent step in the workflow. An obvious example is keeping files for the first and second read separated properly. Secondly, I'm using toil to run this workflow, and that doesn't support directories as inputs (strictly speaking this isn't a CWL issue of course), complicating things further.
This seems like something that should be easy. Am I missing something obvious here? Any advice on how to do this would be much appreciated.
Are you wanting to return the output of each step as a workflow output because 1) Each intermediate result is a useful and important output on its own or 2) For troubleshooting/debugging purposes
If the reason is 2) then that is a bit out of scope for the CWL language itself -- all the needed information is available to the platform executing your CWL descriptions and they are well placed to provide you options to preserve intermediate outputs and present them to you in a pleasing and useful way. In this case I would invite you to talk with the Toil team about the availability of such a feature. The reference implementation of CWL has a
--leave-outputs
option to "Leave output files in intermediate output directories" but that produces rather ugly paths at the moment.If the reason is 1) then I direct you to the answer below.
Cheers,
Good point about debugging. That is certainly part of the reason, and I agree that extracting the intermediate files from the temporary directories is perfectly fine for that. There is the other aspect as well, but between your and Peter's answer that has been pretty well covered.
Just to illuminate the use case a bit I'll add that with an example like the one above, I'd usually want to keep the QC report and the BAM files. In addition, there is typically output for either gene expression estimates, variant calls or similar, that also need to be kept.