Facing issue with output of nextflow pipeline
2
0
Entering edit mode
5 days ago
harsh ▴ 20

I create a nexflow pipeline to run rna-seq preprocessing.

This is the error i am facing. Can anyone please help me to resolve this ?

ERROR ~ Error executing process > 'FastQC (1)'

Caused by:

      Missing output file(s) `*` expected by process `FastQC (1)` (note: input files are not included in the default matching set)

Command executed:

  mkdir -p /home/PDX_Data/data/output/fastqc
  fastqc --threads 12 -o /home/PDX_Data/data/output/fastqc ERR1084768_1.fastq.gz ERR1084768_2.fastq.gz 2> /home/PDX_Data/data/output/fastqc/error.log

Command exit status:
  0

Command output:
  application/gzip
  application/gzip
  Analysis complete for ERR1084768_1.fastq.gz
  Analysis complete for ERR1084768_2.fastq.gz

Work dir:
  /home/PDX_Data/data/work/a3/0ad1935f61de5cc612bae18d56a242
output nextflow fastqc issue • 525 views
ADD COMMENT
2
Entering edit mode
nextflow.enable.dsl=2

// Define parameters directly
params.reads     = '/home/PDX_Data/data/*_{1,2}.fastq.gz'
params.adapters  = '/home/miniconda3/share/trimmomatic-0.39-2/adapters/NexteraPE-PE.fa'
params.index     = '/home/ref/grch38/genome'
params.gtf       = '/home/ref/Homo_sapiens.GRCh38.113.gtf'
params.output    = '/home/PDX_Data/data/output'
params.threads   = 12

// Ensure output directories exist
process SetupDirectories {
    output:
    path params.output

    script:
    """
    mkdir -p ${params.output}/{fastqc,trimmed,hisat2,bam,counts}
    """
}

// Quality control process
process FastQC {
    input:
    tuple val(sample_id), path(reads)

    output:
    path "*"

    script:
    """
    mkdir -p ${params.output}/fastqc
    fastqc --threads ${params.threads} -o ${params.output}/fastqc ${reads} 2> ${params.output}/fastqc/error.log
    """
}

// Read trimming
process Trimmomatic {
    input:
    path reads

    output:
    path "*_paired.fq.gz"

    script:
    """
    trimmomatic PE -threads ${params.threads} \
        ${reads[0]} ${reads[1]} \
        ${params.output}/trimmed/paired_1.fq.gz ${params.output}/trimmed/unpaired_1.fq.gz \
        ${params.output}/trimmed/paired_2.fq.gz ${params.output}/trimmed/unpaired_2.fq.gz \
        ILLUMINACLIP:${params.adapters}:2:30:10 SLIDINGWINDOW:4:20 MINLEN:36
    """
}

// Alignment with HISAT2
process HISAT2 {
    input:
    path trimmed_reads

    output:
    path "*.sam"

    script:
    """
    hisat2 -p ${params.threads} -x ${params.index} -1 ${trimmed_reads[0]} -2 ${trimmed_reads[1]} -S ${params.output}/hisat2/output.sam
    """
}

// Convert and sort SAM to BAM
process SamtoolsSort {
    input:
    path sam_files

    output:
    path "*.bam"

    script:
    """
    samtools view -@ ${params.threads} -bS ${sam_files} | samtools sort -@ ${params.threads} -o ${params.output}/bam/output.sorted.bam
    """
}

// Feature counting
process FeatureCounts {
    input:
    path sorted_bam

    output:
    path "featureCounts.txt"

    script:
    """
    featureCounts -T ${params.threads} -a ${params.gtf} -o ${params.output}/counts/featureCounts.txt -p -B -C ${sorted_bam}
    """
}

workflow {
    SetupDirectories 
    reads_ch = Channel.fromFilePairs(params.reads, suffix: '_1.fastq.gz')
    reads_ch.view()
    reads_ch | FastQC | Trimmomatic | HISAT2 | SamtoolsSort | FeatureCounts
}
ADD REPLY
1
Entering edit mode
5 days ago

fastqc is not creating output (missing output files) which are expected by your output pattern.

Bugfixing

  • check the fastqc work directory to see what files are being created
  • try to set the output file expected to *.html
ADD COMMENT
0
Entering edit mode
(base) user@user-ProLiant-DL380-Gen9:~/PDX_Data/data/output/fastqc$ ls

ERR1084768_1_fastqc.html  ERR1084768_1_fastqc.zip  ERR1084768_2_fastqc.html  ERR1084768_2_fastqc.zip  error.log

Results are made but nextflow is not able to read them. I think nexftlow is trying to read it in work directory but results are in fastqc subdirectory under output directory. But i don't know to resolve this issue.

ADD REPLY
0
Entering edit mode

Please try a tree -h work (assuming your nextflow work dir is called work ).

ADD REPLY
1
Entering edit mode

Oh - best practice is not this

fastqc --threads ${params.threads} -o ${params.output}/fastqc ${reads} 2> ${params.output}/fastqc/error.log

#but this

fastqc --threads ${params.threads} -o fastqc ${reads} 

ie. don't try to tell nextflow where to create data. It will take care of data management in the work dirs completely. On process completion of each step, it will - if set - write files to the output directory. If you move/write data to output, nextflow will not be able to find that data to use as input in the next step.

ADD REPLY
0
Entering edit mode
3 days ago
mmhryc • 0

With nextflow you don't want to manage paths manually. Instead of creating a hardcoded path you should let it output the results into whatever directory it wants to, and catch them with an appropriate output declaration. Further processes should receive output by <process_name>.out or using the declared emit name. If you want to get the final results in a more convenient location the you should specify it using publishDir, preferentially using 'link' or 'symlink' modes (so you don't copy over large files).

Here's an example. I define FlyeTest process that will assemble CLR reads using flye. Flye's outputs the assembly into <dir>/assembly.fasta where you can specify <dir> with -o option, in my case it will be asm_out/assembly.fasta, and I tell nextflow to take this file as output. publishDir will create a hard link to FlyeTest output and place it in results directory.

A plain fromPath channel would create separate instances of FlyeTest() for each input file, with .collect() I can pass them as 1 array and .join(' ') them into a space separated string of paths.

I pass the value for read_path parameter as: 'data/Cell-?/seq-??.fastq.gz'. ? matches any digit, hence I can have up to 10 directories within data Cell-0 to Cell-9, each with up to 100 fastq files from seq-00 to seq-99.

    process FlyeTest {

    publishDir 'results', mode: 'link'

    input:
    path read_list

    output:
    path 'asm_out/assembly.fasta'

    script:
    """
    flye --pacbio-raw ${read_list.join(' ')} -o asm_out --threads ${task.cpus}
    ""
}

params.reads_path = './'
workflow {
    read_ch = channel.fromPath(params.reads_path)
        .collect()
        .view()
    FlyeTest(read_ch)
}


$ nextflow run FlyeTest.nf --reads_path 'data/Cell-?/seq-??.fastq.gz'

To make things cleaner I suggest setting params to generally acceptable default values (and if that's not possible adding checks) and writing a separate run.sh script with nextflow run <file_name>.nf --<param_name> <param_value> ...

ADD COMMENT

Login before adding your answer.

Traffic: 2192 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6