Nextflow - I/O and index issues
2
0
Entering edit mode
10 months ago
bhumm ▴ 170

This is a crosspost from stack overflow (Nextflow - I/O and index issues). Based on this thread (how to handle exact duplicates...) crossposting is allowed on Biostars, but I will delete if necessary.


I am new to nextflow and am trying to build a pipeline. I am having trouble with passing outputs into new processes, properly capturing outputs, and with BWA-MEM2 index designation.

Here is my script:

#!/usr/bin/env nextflow
// Declare syntax version
nextflow.enable.dsl = 2

// Set parameters
params.fastq = "$projectDir/fastq/*_{1,2}.f*"
params.refgenome = "$projectDir/RefGenome/hg38.fa"

workflow {

    read_pairs_ch = Channel.fromFilePairs(params.fastq, checkIfExists: true) 

    trimAndQC(read_pairs_ch)
    readMapping(params.refgenome, trimAndQC.out.trimmedreads)
}

process trimAndQC {
    debug true
    tag "$sample_id"

    publishDir 'trimmed/', mode: 'copy', overwrite: false

    input:
    tuple val(sample_id), path(fastq)

    output:
    tuple val('sample_id'), path('*_val_1.fq.gz'), path('*_val_2.fq.gz'), emit: trimmedreads

    shell:
    """
    trim_galore --paired --fastqc ${fastq.get(0)} ${fastq.get(1)}
    """
}

// BWA read mapping
process readMapping {
    debug true
    tag "$sample_id"

    publishDir 'bwa_aligned/', mode: 'copy', overwrite: false

    input:
    path(refgenome)
    tuple val(sample_id), val(trimmedreads)

    output:
    path('${sample_id}_sorted.bam'), emit: sorted_bam

    script:
    def indexbase = refgenome.baseName

    """
    bwa-mem2 mem -t 2 '${indexbase}' '${trimmedreads}' | samtools sort -@2 -o '${sample_id}_sorted.bam' -
    """
}

First, I am getting this warning:

WARN: Input tuple does not match input set cardinality declared by process `readMapping` -- offending value: [sample_id, /Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_1_val_1.fq.gz, /Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_2_val_2.fq.gz]

Which is confusing as my output is a tuple and thus should match the cardinality of the output of trimAndQC. As an aside, I have noticed most scripts passing the input tuple (in this case path(fastq)) as a single argument, but to get the command to work I have to use the .get() accession method to 'unpack' the tuple, so I am unsure what I am doing wrong there.

Second, I am getting this error in regards to the output of readMapping process:

ERROR ~ Error executing process > 'readMapping (sample_id)'

Caused by:
  Missing output file(s) `${sample_id}_sorted.bam` expected by process `readMapping (sample_id)`

Command executed:

  bwa-mem2 mem -t 2 'hg38' '/Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_1_val_1.fq.gz' | samtools sort -@2 -o sample_id_sorted.bam -

Command exit status:
  0

Command output:
  (empty)

I assume this may be due to the cardinality warning above and sample_id is empty.

Finally, I cannot seem to designate the index properly to the BWA-MEM2 command. The error:

Command error:
  Looking to launch executable "/Users/me/opt/anaconda3/envs/nextflow/bin/bwa-mem2.sse42", simd = .sse42
  Launching executable "/Users/me/opt/anaconda3/envs/nextflow/bin/bwa-mem2.sse42"
  -----------------------------
  Executing in SSE4.2 mode!!
  -----------------------------
  * SA compression enabled with xfactor: 8
  * Ref file: hg38
  * Entering FMI_search
  ERROR! Unable to open the file: hg38.bwt.2bit.64

Work dir:
  /Users/me/Desktop/fastq/work/23/f8b3b6eace6f11f8619ce435b16725

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

In my RefGenome directory I have hg38.fa indexed on an AWS EC2 and downloaded to locally to prototype the pipeline. The files look have these extensions: hg38.fa.{0123,amb,ann,pac,bwt.2bit.64} and appear to work fine if I just run the bwa-mem2 mem command in my terminal.

I have tried to adapt multiple SO answers (e.g., how to chain Channel.fromFilePairs from one process to another process, nextflow: how to pass directory path with files created in process before), I have gone through the EMBL training (Nextflow), looked through pipelines on nf-core and github, and have spent time with nextflow documentation prior to asking this question. I seem to be misunderstanding some inherent principle of nextflow so any advice on what to focus would be appreciated.

Finally, for clarity and reproducibility I am using the following fastqs from NCBI SRA (SRX317818) for prototyping. These are the versions of nextflow, trim-galore, and bwa-mem2 I am using all via an Anconda environment:

bwa-mem2                  2.2.1                hdf58011_5    bioconda
nextflow                  23.10.1              hdfd78af_0    bioconda
trim-galore               0.6.10               hdfd78af_0    bioconda
bwa-mem2 groovy dsl nextflow trim-galore • 2.7k views
ADD COMMENT
3
Entering edit mode
9 months ago
dthorbur ★ 2.5k

I can't see the stackoverflow post, the link leads me to a error message. Anyway, it's not important.

First problem:

The first issue is not keeping the same tuple structure. You have written the following, which is a tuple with 3 objects, tuple[id, r1, r2] as output from process trimAndQC

tuple val('sample_id'), path('*_val_1.fq.gz'), path('*_val_2.fq.gz'), emit: trimmedreads

then you declare the input of the mapping process as tuple[id, trimmed_reads], when it should be tuple[id, r1, r2]. In addition, you are using val(trimmedreads) when it should be path().

tuple val(sample_id), val(trimmedreads)

Second problem:

It looks to me like you haven't indexed your reference genome. You only need to do this once, and the index files should be loaded into the working environment in the same way you have with the reference genome. See docs for bwa index ref.fa here.

ADD COMMENT
0
Entering edit mode

dthorbur Thanks for your response. Firstly, thanks for letting me know about the SO link - I have updated and it should work now.

I have updated the input of the readMapping process as you suggested:

input:
path(index)
tuple val(sample_id), path(reads1), path(reads2)

It works great, thanks for catching that simple mistake!

As for the indexed reference genome, I believe it was indexed properly. As mentioned, I indexed on an EC2 instance as my laptop was hanging whenever I would try indexing locally. The files look like this:

user$ du hg38.fa.{0123,amb,ann,bwt.2bit.64,pac}

12538688    hg38.fa.0123
48          hg38.fa.amb
48          hg38.fa.ann
20371448    hg38.fa.bwt.2bit.64
1567040         hg38.fa.pac

They seem to be the right files and of a reasonable size.

Perhaps I will try on my EC2 and see if there was something lost during the compression and transfer of the files to my local machine. All that being said, when looking at other questions on Biostars or SO, I notice most have an indexing process within their pipeline. This makes passing the index to the alignment step trivial as they can just access the emitted variable from the index process. As you stated, indexing only needs to occur once, therefore I would like to skip this step in my pipeline by designating a 'pre' indexed reference. Therefore, what is considered best practice when passing an index? Adding a parameter designating the base?

params.index = "$projectDir/RefGenome/hg38"

or something like I attempted in my current script?

params.refgenome = "$projectDir/RefGenome/hg38.fa"
def indexbase = refgenome.baseName

Also, while I realize it is effectively the same process, I am using bwa-mem2 over bwa.

ADD REPLY
0
Entering edit mode

So I have updated my script a bit. Due to issues with BWA-MEM2 accessing its associated index, I opted to reindex my reference using standard bwa. Here is the command:

bwa index hg38.fa

and it proceeds fine ending in the following message:

[bwt_gen] Finished constructing BWT in 710 iterations.
[bwa_index] 1872.12 seconds elapse.
[bwa_index] Update BWT... 14.54 sec
[bwa_index] Pack forward-only FASTA... 10.85 sec
[bwa_index] Construct SA from BWT and Occ... 1028.07 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index hg38.fa
[main] Real time: 2951.084 sec; CPU: 2946.132 sec

I then ran my updated script:

#!/usr/bin/env nextflow

// Declare syntax version
nextflow.enable.dsl = 2

// Set parameters
params.fastq = "$projectDir/fastq/*_{1,2}.f*"
params.refgenome = "$projectDir/RefGenome/hg38.fa"
params.index = "/Users/me/Desktop/RefGenome/BWAIndex/*"

workflow {
    read_pairs_ch = Channel.fromFilePairs(params.fastq, checkIfExists: true) 

    trimAndQC(read_pairs_ch)
    readMapping(params.index, trimAndQC.out.trimmedreads)
}

process trimAndQC {
    debug true
    tag "$sample_id"

    publishDir 'trimmed/', mode: 'copy', overwrite: false

    input:
    tuple val(sample_id), path(fastq)

    output:
    tuple val('sample_id'), path('*_val_1.fq.gz'), path('*_val_2.fq.gz'), emit: trimmedreads

    shell:
    """
    trim_galore --paired --fastqc ${fastq.get(0)} ${fastq.get(1)}
    """
}

// BWA read mapping
process readMapping {
    debug true
    tag "$sample_id"

    publishDir 'bwa_aligned/', mode: 'copy', overwrite: false

    input:
    path(index)
    tuple val(sample_id), path(reads1), path(reads2)

    output:
    path('${sample_id}_sorted.bam'), emit: sorted_bam

    script:
    """
    bwa mem -t 2 ${index} ${reads1} ${reads2} | samtools sort -@2 -o '${sample_id}_sorted.bam' -
    """
}

and am met with a new error:

ERROR ~ Error executing process > 'readMapping (sample_id)'

Caused by:
  Process `readMapping (sample_id)` terminated with an error exit status (1)

Command executed:

  bwa mem -t 2 * SRR925811_1_val_1.fq.gz null | samtools sort -@2 -o 'sample_id_sorted.bam' -

Command exit status:
  1

Command output:
  (empty)

Command error:
  ln: target 'SRR925811_2_val_2.fq.gz': Not a directory

Work dir:
  /Users/me/Desktop/fastq/work/a4/ef78a4de86838a7acadef4cb3bb50b

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

which is thoroughly confusing as the 'SRR925811_2_val_2.fq.gz' file is passed from the previous step via an emitted variable.

Further, if I cd into the relevant work directory and run the command:

bwa mem -t 2 ~/Desktop/RefGenome/BWAIndex/hg38.fa SRR925811_1_val_1.fq.gz SRR925811_2_val_2.fq.gz

bwa starts up just fine. I am thoroughly stumped as to this new error.

ADD REPLY
0
Entering edit mode

Looking a little closer at the error, I notice the command executed is:

bwa mem -t 2 * SRR925811_1_val_1.fq.gz null | samtools sort -@2 -o 'sample_id_sorted.bam' -

where the mate pair is null. I am unsure why this is happening as the variable is passed from the preceding process in the workflow. It appears it is an issue with the input tuple I am passing, but I cannot see my mistake.

ADD REPLY
1
Entering edit mode

A few things.

Firstly, you can use bwa-mem2 index reference.fasta to index, though I don't know if there is a difference to normal bwa index.

Secondly, instead of get, you can use def to get around tuples with multiple objects. But I don't know what is considered best practices.

process trimAndQC {
    ...
    input:
    tuple val(sample_id), path(fastq)

    output:
    tuple val('sample_id'), path('*_val_1.fq.gz'), path('*_val_2.fq.gz'), emit: trimmedreads

    script:
    def (read1, read2) = reads
    """
    trim_galore --paired --fastqc ${read1} ${reads2}
    """
}

Thirdly, I think one of the problems are that you are only giving a path for the index files, not a channel, and if you did make a channel you are creating a queue channel with params.index = "/Users/me/Desktop/RefGenome/BWAIndex/*".

I'm used to errors from DSL1, but queue channels and value channels are used very differently, and what you have done will iterate over each file in the index directory rather than loading them all. Also, you only need the index in the environment, only the reference genome fasta is called in the bwa-mem2 command. Also wildcard in a whole directory feels a little dangerous to me, but maybe that's just me.

Here is an update for your process

process readMapping {
    debug true
    tag "$sample_id"

    publishDir 'bwa_aligned/', mode: 'copy', overwrite: false

    input:
    path(reference)
    path(index)
    tuple val(sample_id), path(reads1), path(reads2)

    output:
    path("${sample_id}_sorted.bam"), emit: sorted_bam

    script:
    """
    bwa mem -t ${params.BWA_cpus} ${reference} ${reads1} ${reads2} | samtools sort -@2 -o "${sample_id}_sorted.bam" -
    """
}

And an updated workflow

workflow {
    read_pairs_ch = Channel.fromFilePairs(params.fastq, checkIfExists: true) 
    Channel
      .fromPath(params.refgenome)
      .ifEmpty { error "No fasta found in ${params.refgenome}" }
      .first()  
      .set { reference }
    Channel
      .fromPath(params.index)
      .ifEmpty { error "No files found in ${params.index}" }
      .collect()  // This is how to group all files into a single object.
      .set { ref_index }

    trimAndQC(read_pairs_ch)
    readMapping(reference, ref_index, trimAndQC.out.trimmedreads)
}

And here is how I used indexes for other processes to make value channels in the past.

ref_genome = file( params.RefGen, checkIfExists: true )
ref_dir    = ref_genome.getParent()
ref_name   = ref_genome.getBaseName()
ref_dict   = file( "${ref_dir}/${ref_name}.dict", checkIfExists: true )
ref_index  = file( "${ref_dir}/${ref_name}*.fai", checkIfExists: true ).first()

Notice the use of /first(), this will ensure the channel is a value channel though I think it's less importand in DSL2.

ADD REPLY
0
Entering edit mode

dthorbur Hi, thanks for the response.

BWA and BWA-MEM2 do create different index files, so unfortunately their respective indexes are not compatible (from what I can tell).

Thanks for the tip, I will definitely utilize this variable assignment in future scripts and to get around any tuple data structure issues.

Thanks for pointing that out - I was (and still am) unclear as to queue vs value channels and will review the documentation appropriately. In regard to this, should I be designating the channel prior to passing to the workflow? Such as:

reference = Channel.fromPath(params.refgenome)

I guess I am trying to understand what is different between setting the channel using the method above vs using the path() method within an input block within the process. 

I was originally passing the reference genome (params.refgenome) but would still be met with an index error, hence my use of creating a directory and globbing all the files within. This has seemed to work, but I agree with the potential side effects of wildcarding an entire directory.

I will give some of your updated processes a go and see if they resolve my issue. All this being said, I seem to have found a relative fix to my original input and my index issues. However my most recent problem is output from the trimAndQC process where the 'read2' mate pair file, deriving from the emitted trimmedreads variable is being interpreted as a directory (see above comment and the following snippet of the error).

ERROR ~ Error executing process > 'readMapping (sample_id)'

Caused by:
  Process `readMapping (sample_id)` terminated with an error exit status (1)

Command executed:

  bwa mem -t 2 * SRR925811_1_val_1.fq.gz SRR925811_2_val_2.fq.gz | samtools sort -@2 -o 'sample_id_sorted.bam' -

Command exit status:
  1

Command output:
  (empty)

Command error:
  ln: target 'SRR925811_2_val_2.fq.gz': Not a directory

Which is confusing as the 'read1' is interpreted properly and the output tuple appears to be defined correctly. Any insight you may have on this would be appreciated. Thanks again for your help and patience!

ADD REPLY
0
Entering edit mode

So an update. I basically made a script from the processes and workflow you provided and am still unfortunately getting an error.

First when passing the reference and the index to the readMapping process I get variable name collison:

ERROR ~ Error executing process > 'readMapping (1)'

Caused by:
  Process `readMapping` input file name collision -- There are multiple input files for each of the following file names: hg38.fa

After changing my param file patterns a bit (i.e., hg38* to just '*' to try and increase stringency) I was met with this unbound variable error:

ERROR ~ Error executing process > 'readMapping ($sample_id)'

Caused by:
  Process `readMapping ($sample_id)` terminated with an error exit status (1)

Command executed:

  bwa mem -t 2 hg38.fa SRR925811_1_val_1.fq.gz SRR925811_2_val_2.fq.gz | samtools sort -@2 -o "$sample_id_sorted.bam" -

Command exit status:
  1

Command output:
  (empty)

Command error:
  .command.sh: line 2: sample_id_sorted: unbound variable
  [E::bwa_idx_load_from_disk] fail to locate the index files

Which is confusing as I am passing the input tuple properly. To simplify troubleshooting, I removed the following command from the process:

samtools sort -@2 -o "${sample_id}_sorted.bam" -

So when I remove the reference and just past the index I get this error:

ERROR ~ Error executing process > 'readMapping ($sample_id)'

Caused by:
  Process `readMapping ($sample_id)` terminated with an error exit status (1)

Command executed:

  bwa mem -t 2 hg38.fa.ann hg38.fa.pac hg38.fa.sa hg38.fa.amb hg38.fa.bwt hg38.fa SRR925811_1_val_1.fq.gz SRR925811_2_val_2.fq.gz

Command exit status:
  1

Command output:
  (empty)

Where nextflow is being too literal with the reference passed to the command.

And when I remove the index and just pass the reference (which is in the same directory as the index files) I get this error:

ERROR ~ Error executing process > 'readMapping ($sample_id)'

Caused by:
  Process `readMapping ($sample_id)` terminated with an error exit status (1)

Command executed:

  bwa mem -t 2 hg38.fa.ann SRR925811_1_val_1.fq.gz SRR925811_2_val_2.fq.gz

Command exit status:
  1

Command output:
  (empty)

Command error:
  [E::bwa_idx_load_from_disk] fail to locate the index files

For full context here is the script I most recently ran:

#!/usr/bin/env nextflow

nextflow.enable.dsl = 2

params.fastq = "$projectDir/fastq/*_{1,2}.f*"
params.refgenome = "$projectDir/RefGenome/BWAIndex/hg38.fa"
params.index = "$projectDir/RefGenome/BWAIndex/*"

process trimAndQC {
    debug true
    tag "$sample_id"

    input:
    tuple val(sample_id), path(fastq)

    output:
    tuple val('$sample_id'), path('*_val_1.fq.gz'), path('*_val_2.fq.gz'), emit: trimmedreads

    script:
    def (read1, read2) = fastq
    """
    trim_galore --paired --fastqc ${read1} ${read2}
    """
}

process readMapping {
    debug true
    tag "$sample_id"

    publishDir 'bwa_aligned/', mode: 'copy', overwrite: false

    input:
    //path(reference)
    path(reference)
    path(index)
    tuple val(sample_id), path(reads1), path(reads2)

    output:
    path("${sample_id}.bam"), emit: sorted_bam

    script:
    """
    bwa mem -t 2 ${reference} ${reads1} ${reads2}
    """
}

workflow {
    read_pairs_ch = Channel.fromFilePairs(params.fastq, checkIfExists: true) 
    Channel
      .fromPath(params.refgenome)
      .ifEmpty { error "No fasta found in ${params.refgenome}" }
      .first()
      .set { reference }
    Channel
      .fromPath(params.index)
      .ifEmpty { error "No files found in ${params.index}" }
      .collect()  // This is how to group all files into a single object.
      .set { ref_index }

    trimAndQC(read_pairs_ch)
    readMapping(reference, ref_index, trimAndQC.out.trimmedreads)
}

Only other thing I can think of trying is your method of making indexes and value channels at the end of your last comment. I can report back once I have tried to adapt that to my machine.

ADD REPLY
0
Entering edit mode

The last error makes sense, but why is a .ann file being loaded. If you are using the script verbatim that you put at the bottom, that couldn't have happened.

And the index not being present is correct, it's in the original path, but it has not been loaded into the working directory. Nextflow creates stages and creates symlinks of files in channels, so if it doesn't know it needs to load a file, it won't and by default it will use files in the working directory. Not so much a problem when working on a local machine or some HPCs, but when you use cloud computing absolute paths become a problem.

Otherwise, I don't see why you should be getting errors with the script you loaded at the bottom.

Also, I'm a little concerned why the sample_id values are seemingly not working.

ADD REPLY
0
Entering edit mode

dthorbur I am using the script I posted verbatim so I am just as confused as you are.

To simplify things, I am testing the problematic process (i.e., readMapping) into an individual script to try to sort out the index issue. Also after looking at some other SO and Biostar posts and combined with your above index advice I am running the following script:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

params.index_dir = "$projectDir/RefGenome/BWAIndex"
params.ref = "hg38.fa"
params.fastq = "$projectDir/fastq/trimmed/*_{1,2}*"

process align {
    debug true
    tag "${sample_id}"

    input:
    path index_dir
    val ref
    tuple val(sample_id), path(fastq)

    output:
    path "${sample_id}_sorted.bam'"

    script:
    //def (read1, read2) = fastq
    """
    bwa mem ${index_dir}/${ref} ${fastq} | samtools sort -@2 -o "${sample_id}_sorted.bam" -
    """
}

workflow {
    index_ch = Channel.fromPath(params.index_dir)
    ref_ch = Channel.of(params.ref)
    fastq_ch = Channel.fromFilePairs(params.fastq)

    align(index_ch, ref_ch, fastq_ch)
}

And while it appears to happily ingest the index and it does not throw an error, it does not 'run'. This is what my terminal looks like:

(nextflow) me $ nextflow align.nf
N E X T F L O W  ~  version 23.10.1
Launching `align.nf` [sleepy_bhaskara] DSL2 - revision: 7066526b82
[-        ] process > align -

and it just does nothing. So it is starting up, but doesn't seem to execute anything. Further, the work directory made by nextflow is empty so there is no log I can try to investigate. I am completely lost here and not sure what is wrong. I have also tried removing the samtools part of the command and the same thing happens. Any advice is appreciated.

ADD REPLY
0
Entering edit mode

After the last set of errors and out of pure frustration, I decided to create a new Anaconda environment with freshly downloaded software. Here is the yaml I used:

name: newenv
channels:
  - defaults
  - bioconda
  - conda-forge
dependencies:
  - nextflow
  - samtools
  - trim-galore
  - bwa
  - python

I then adapted your method of index designation. Here is the simplified script I ended up using:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

params.index_dir = "$projectDir/RefGenome/BWAIndex"
params.ref = "hg38.fa"
params.fastq = "$projectDir/fastq/trimmed/*_{1,2}*"

process align {
    debug true
    tag "${sample_id}"

    input:
    path index_dir
    val ref
    tuple val(sample_id), path(fastq)

    output:
    path "${sample_id}_sorted.bam"

    script:
    //def (read1, read2) = fastq
    """
    bwa mem ${index_dir}/${ref} ${fastq} | samtools sort -@2 -o "${sample_id}_sorted.bam'"
    """
}

workflow {
    index_ch = Channel.fromPath(params.index_dir)
    ref_ch = Channel.of(params.ref)
    fastq_ch = Channel.fromFilePairs(params.fastq)

    align(index_ch, ref_ch, fastq_ch)
}

This worked and made the sorted bam file within the work directory. I am rerunning with the correction and then will test with the trim process and report update the post as necessary.

ADD REPLY
1
Entering edit mode
9 months ago
bhumm ▴ 170

Thanks to dthorbur for his help and patience with my problem.


I have finally completely resolved all my problems and want to document here in case anyone else runs into such issues in the future.

First problem was issues with cardinality. As dthorbur pointed out, my issue was not keeping the same tuple structure between processes. The output from the upstream process:

output:
tuple val('sample_id'), path('*_val_1.fq.gz'), path('*_val_2.fq.gz')

Whereas the input into the downstream process:

input:
tuple val(sample_id), val(trimmedreads)

Resulting in an easy fix:

input:
tuple val(sample_id), path(read1), path(read2)

Next issue I was having was ingesting the index for BWA into my working environment. I found the most straightforward and simple method was to designate the index directory and fasta file separately like this:

// Set paths for index directory and the reference fasta
params.index_dir = "$projectDir/RefGenome/BWAIndex"
params.ref = "hg38.fa"
------------------------
...
// Designate the index within the processes like this
script:
"""
bwa mem -t 2 ${index_dir}/${ref} ${read1} ${read2} | samtools sort -@2 -o "${sample_id}_sorted.bam" -
"""
------------------------

// Define the channels and pass them to the process
index_ch = Channel.fromPath(params.index_dir)
ref_ch = Channel.of(params.ref)
...
bwa_align(index_ch, ref_ch, trim_QC.out.trimmed_reads)

My final problem was getting the bwa_align process to run properly. My original error was related to no output file, which I realized was an issue with single quotes around the output file and within the script block:

output:
path('${sample_id}_sorted.bam')

which Nextflow wasn't interpreting properly. Once I switched to double quotes all was fine. So I am tempted to strictly use double quotes when writing Nextflow scripts. My final issue as evidenced by the broad range of error messages documented above was solved by creating a new Anaconda environment and reinstalling all the dependencies. A very unsatisfying answer and does not get to the root of the problem, but it helped me nonetheless.

Here is my final script that finally worked:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

// Define parameters
params.fastq = "$projectDir/fastq/*_{1,2}*"
params.index_dir = "$projectDir/RefGenome/BWAIndex"
params.ref = "hg38.fa"

workflow {

    fastq_ch = Channel.fromFilePairs(params.fastq)
    index_ch = Channel.fromPath(params.index_dir)
    ref_ch = Channel.of(params.ref)

    trim_QC(fastq_ch)
    bwa_align(index_ch, ref_ch, trim_QC.out.trimmed_reads)
}

process trim_QC {
    debug true
    tag "$sample_id"

    publishDir 'trimmed/', mode: 'copy', overwrite: false

    input:
    tuple val(sample_id), path(fastq)

    output:
    tuple val("${sample_id}"), path("*_val_1.fq.gz"), path("*_val_2.fq.gz"), emit: trimmed_reads

    script:
    def (read1, read2) = fastq
    """
    trim_galore --paired --fastqc ${read1} ${read2}
    """
}

process bwa_align {
    debug true
    tag "$sample_id"

    publishDir 'bwa_aligned/', mode: 'copy', overwrite: false

    input:
    path index_dir
    val ref
    tuple val(sample_id), path(read1), path(read2)

    output:
    path("${sample_id}_sorted.bam"), emit: sorted_bam

    script:
    """
    bwa mem -t 2 ${index_dir}/${ref} ${read1} ${read2} | samtools sort -@2 -o "${sample_id}_sorted.bam" -
    """
}
ADD COMMENT

Login before adding your answer.

Traffic: 1528 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6