Question

How to give 2 input files (one from the process before and one from 3 processes before) in nextflow pipeline?

0

Entering edit mode

4 months ago

giulia.trauzzi ▴ 30

Hi everyone,

I am slowly learning NextFlow and I am trying to put together a very basic pipeline with a few processes (just a series of bash commands to manipulate a csv file). It worked until I added the last process (see below)

process ReplaceI5 {
input:
path 'test4.csv'
path 'test1.csv'
output:
path 'test5.csv'
script:
"""
awk -v FS=, -v OFS=, 'FNR==NR{hash[FNR]=$0; next}{$8 = hash[FNR]}1' test4.csv test1.csv > test5.csv
"""
}

This process requires the files test4.csv that comes from the previous process and test1.csv which is the output of the first process of the pipeline. If I run the pipeline like this, it gives me an error and I think it is connected to the fact that maybe I cannot input 2 files as input.

I am not familiar with Nextflow enough to understand exactly how to proceed or which options I have here. I was thinking, maybe re-directing test4 and test1 to the same directory when generated and then collect them from a shared folder in this process?

Would appreciate some help/suggestions or examples you may have.

Best,

Giulia

nextflow nf-core processes channels • 1.2k views

ADD COMMENT • link updated 4 months ago by colindaven 7.0k • written 4 months ago by giulia.trauzzi ▴ 30

3

Entering edit mode

If I run the pipeline like this, it gives me an error

show us the error

I cannot input 2 files as input.

wrong

input:
     tuple path(test4),path(test1) //do not quote the path
script:
"""
awk -v FS=, -v OFS=, 'FNR==NR{hash[FNR]=\$0; next}{\$8 = hash[FNR]}1' '${test4}' '${test1}' > test5.csv
"""

ADD REPLY • link 4 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

Unfortunately, this has not worked either. I posted the error below.

Best,

Giulia

ADD REPLY • link 4 months ago by giulia.trauzzi ▴ 30

0

Entering edit mode

Thank you for your thorough reply dthorbur you are right.

Here's the error

N E X T F L O W ~ version 23.10.0 Launching main.nf [sharp_mandelbrot] DSL2 - revision: 3817c70cde unknown recognition error type: groovyjarjarantlr4.v4.runtime.LexerNoViableAltException ERROR ~ Script compilation error
file : /mnt/data10/users/giulia_trauzzi/nf_test/bash_test/main.nf
cause: token recognition error at: '0' @ line 61, column 50. -v OFS=, 'FNR==NR{hash[FNR]=$0; next}{$
                           ^
1 error

This pipeline only had 1 process at the beginning and I followed one of the example pipelines from downloaded from github ( process-collect.nf but instead of using fq.gz files, I am using csv files. Then, I slowly worked my way through the other processes.

I struggle to understand the example you reported, but I will read into tuple and into the documentation a bit more.

Thanks,

Giulia

ADD REPLY • link 4 months ago by giulia.trauzzi ▴ 30

0

Entering edit mode

'$' in awk must be escaped. See my code.

ADD REPLY • link 4 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

The tuple was just to show you how the logic ais applied to other input types. The nextflow learning curve can be pretty tough. Pierre has a point I forgot. The input declarations shouldn't be the file name, but a nextflow variable that will be assigned as the actual file when nextflow stages in the .command.sh file.

So if you use

input:
  path(file)

if should be referred to in the script block as ${file}. It so happens in your example that your hard coded variable names also happened to be the actual file names, so it works, but is bad practice.

ADD REPLY • link 4 months ago by dthorbur ★ 2.6k

score 3 · Answer 1 · 2024-08-15

3

Entering edit mode

4 months ago

Matthias Zepper 5.0k

You are trying to hard-code your file names as channel names here.

I suggest this excellent introduction to Nextflow scripting, in particular the pages about channels and processes. If you have a little more time, you might also be interested in the nf-core foundational training.

ADD COMMENT • link 4 months ago by Matthias Zepper 5.0k

1

Entering edit mode

This is also part of my learning process, however as I am a real beginner in nextflow, I currently struggle with pipelines/examples and terms used in the Documentation and I learn better from putting things together first and then backing all of this with theory and reading (or do it at the same time). Thanks for the links.

Giulia

ADD REPLY • link 4 months ago by giulia.trauzzi ▴ 30

1

Entering edit mode

Obviously, every learner is different and you know best what works for you.

However, I am positive you would benefit from reading about some core concepts of dataflow programming in general and Nextflow terminology first before trying to put something together. Knowing the difference of value and queue channels in Nextflow or having understood that there is no concept of time in a dataflow programming language will help to interpret and modify what you see in terms of examples.

The linked pages are part of an introduction to Nextflow that I can warmly recommend.

ADD REPLY • link 4 months ago by Matthias Zepper 5.0k

0

Entering edit mode

I agree, going through them now :)

Hopefully it will give me clarity! Thanks for your help.

Giulia

ADD REPLY • link 4 months ago by giulia.trauzzi ▴ 30

score 3 · Answer 2 · 2024-08-15

You can certainly have process inputs from multiple different previous processes, so that's not your problem.

I suspect this error is due to how Nextflow handles variable names, but you haven't included the error , so it's hard to tell. Nextflow first processes the script block with groovy to handle things like input variables and params. For example:

process example {
  input: 
    tuple val(id), path(pep_chunks)

  output:
    path "*_targetp.txt", emit: tp_out

  script:
  """
  part=`basename ${pep_chunks} | sed -r "s/(.*)(part_[0-9]+)(.*)/\\\\2/"`
  targetp -fasta ${pep_chunks}
  mv *.targetp2 ${sample_id}_\${part}_targetp.txt
  """
}

You have to escape variable names with \$var if they are to be executed as $var in the script block otherwise groovy thinks it's a variable it should have and doesn't. Note I created a variable called part in the script block, but called it later using \${part}. If you check the .command.sh in the working directory you can see how this is handled.

TL;DR, I think you need to replace $ with \$ in your awk command.

score 1 · Answer 3 · 2024-08-16

1

Entering edit mode

4 months ago

colindaven 7.0k

I find the docs as mentioned above, and also the patterns site really useful https://nextflow-io.github.io/patterns/channel-duplication/ .

But for your example ReplaceI5.

Add input channels - so you can deal with x input files, not just your two hard coded filenames. Lets say you want to work on 1000 inputs, not just 1.
Replace your file inputs with variables - which come from the input channels (rename your input files)
make your output path dynamic *result.csv

Rough template code:

workflow {
  input_ch1 = Channel.fromPath("/data/*set1.csv")
  input_ch2 = Channel.fromPath("/data/*set2.csv")


  ReplaceI5(input_ch1, input_ch2)
}


process ReplaceI5 {

  input:
  path csv1
  path csv2

  output:
  path *result.csv

  script:
  """
  awk -v FS=, -v OFS=, 'FNR==NR{hash[FNR]=$0; next}{\$8 = hash[FNR]}1' $csv1 $csv2 > test_result.csv
  """
}

ADD COMMENT • link 4 months ago by colindaven 7.0k

1

Entering edit mode

I'm not sure it would work; input should be input_ch1.combine( input_ch2)
why a dynamic name (must be quoted) when there is only one file ?
awk variables must be escaped

ADD REPLY • link 4 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

it has not worked for me and got the same error message.

Giulia

ADD REPLY • link 4 months ago by giulia.trauzzi ▴ 30

0

Entering edit mode

@Pierre - I was more interested in demonstrating the use of multiple input channels and file variables in nextflow, and not solving an awk problem. You could combine the input channels - but this is basic learning, so why not start with an easy option. It all depends what the OP is trying to solve.

ADD REPLY • link 4 months ago by colindaven 7.0k

1

Entering edit mode

Thank you both colindaven and dthorbur for your suggestions.

I understand the idea of giving variables to input files and I replaced it in the pipeline and tested it. It works!

I have a question that may sound silly, but I have only replaced variables for the input files (path x) and then called it in the script as both of you mentioned, but I did not replace the output file. As I understand a nextflow pipeline assumes sequentiality, taking the the output from the first process and using it as input in the following one. The input of the following process will be another variable (assigned by me as it is good practice, eg., y) which will automatically mean the output of the first process. Would this be correct?

Thanks,

Giulia

ADD REPLY • link 4 months ago by giulia.trauzzi ▴ 30