Hello,
I'm trying to concatenate fastq files using Nextflow but I noticed that it doesn't seem to work the way I wanted it to be. I saw this post (Merge fastq files ) and basically copied it off.
nextflow.enable.dsl=2
def getLibraryId( prefix ){
// fastqfile = ABC-S16_L001_R1_001.fastq.gz, ABC-S16_L002_R1_001.fastq.gz
prefix.split("_")[0]
}
//params.raw_data_dir = "rawdata/"
// Gather the pairs of R1/R2 according to sample ID
Channel
.fromFilePairs(params.rawdata + '/*_R{1,2}*.fastq.gz', flat: true, checkExists: true)
.map { prefix, R1, R2 -> tuple(getLibraryId(prefix), R1, R2) }
.groupTuple().set{ files_channel }
process merge_lane {
debug true
tag "merging ${sample}"
cpus 2
memory '2 GB'
time '2h'
publishDir "${launchDir}/analysis/merge_lane", mode : "copy"
input:
tuple val(sample), path(R1), path(R2)
output:
path("${sample}_R1.fastq.gz")
path("${sample}_R2.fastq.gz")
script:
"""
cat ${ R1.collect{ it }.sort().join(" ") } > ${sample}_R1.fastq.gz
cat ${ R2.collect{ it }.sort().join(" ") } > ${sample}_R2.fastq.gz
"""
}
Nextflow generated .command.sh
for each sample and I noticed that some of them didn't look right. For example:
This is what I wanted to do. cat Sample_L001_R1_001.fastq.gz Sample_L002_R1_001.fastq.gz > Sample_R1.fastq.gz
#!/bin/bash -ue
cat 6305-No_E_S23_L001_R1_001.fastq.gz 6305-No_E_S23_L002_R1_001.fastq.gz > 6305_R1.fastq.gz
cat 6305-No_E_S23_L001_R2_001.fastq.gz 6305-No_E_S23_L002_R2_001.fastq.gz > 6305_R2.fastq.gz
But for some reason, as you can see the script below, nextflow/groovy didn't seem to sort fastq files by name.
#!/bin/bash -ue
cat 6298-No_E_S16_L002_R1_001.fastq.gz 6298-No_E_S16_L001_R1_001.fastq.gz > 6298_R1.fastq.gz
cat 6298-No_E_S16_L001_R2_001.fastq.gz 6298-No_E_S16_L002_R2_001.fastq.gz > 6298_R2.fastq.gz
Could you advise me on how to prevent this in Nextflow?
Thanks for sharing the code! I will be trying that.