With nextflow you don't want to manage paths manually. Instead of creating a hardcoded path you should let it output the results into whatever directory it wants to, and catch them with an appropriate output declaration. Further processes should receive output by <process_name>.out
or using the declared emit name. If you want to get the final results in a more convenient location the you should specify it using publishDir
, preferentially using 'link'
or 'symlink'
modes (so you don't copy over large files).
Here's an example. I define FlyeTest process that will assemble CLR reads using flye. Flye's outputs the assembly into <dir>/assembly.fasta
where you can specify <dir>
with -o
option, in my case it will be asm_out/assembly.fasta
, and I tell nextflow to take this file as output. publishDir
will create a hard link to FlyeTest output and place it in results
directory.
A plain fromPath channel would create separate instances of FlyeTest()
for each input file, with .collect()
I can pass them as 1 array and .join(' ')
them into a space separated string of paths.
I pass the value for read_path parameter as: 'data/Cell-?/seq-??.fastq.gz'. ?
matches any digit, hence I can have up to 10 directories within data Cell-0
to Cell-9
, each with up to 100 fastq files from seq-00
to seq-99
.
process FlyeTest {
publishDir 'results', mode: 'link'
input:
path read_list
output:
path 'asm_out/assembly.fasta'
script:
"""
flye --pacbio-raw ${read_list.join(' ')} -o asm_out --threads ${task.cpus}
""
}
params.reads_path = './'
workflow {
read_ch = channel.fromPath(params.reads_path)
.collect()
.view()
FlyeTest(read_ch)
}
$ nextflow run FlyeTest.nf --reads_path 'data/Cell-?/seq-??.fastq.gz'
To make things cleaner I suggest setting params to generally acceptable default values (and if that's not possible adding checks) and writing a separate run.sh
script with nextflow run <file_name>.nf --<param_name> <param_value> ...