Question

Missing Output Exception Error in Snakemake with output directory

0

Entering edit mode

9 months ago

diop.awa94 • 0

I'm trying to run a workflow on snakemake. I have to automate a couple of steps which are depending all on python scripts or pipelines already made. My rule Gene_flow_between _species has to run once on each list of genomes and build a core-genome as output for each run (for a total of three runs). Snakemake seems to be able to locate the input files and to start to complete the first steps of the first run in the workflow, but for some reason exits out with MissingOutput Exception error stating it cannot find the output files for the second and the third runs of the workflow.

The output should be written in the directory core_genome and contains some files and subdirectories. (the input.dir contains all the genomes and the input.liste is a file in which I listed the selected genomes to analyze for each run and the 'output directory' is the folder core_genome)

So far, I did this (which is not working, as we can expect)

configfile: "config_cand.yaml"
dirname = config["dirname"]

rule Gene_flow_between_species_all:
   input:
         expand("{dirname}/core_genome", dirname=dirname)

rule Gene_flow_between_species:
  input:
        dir = 'Campylobacter/genomes',
        liste = expand("{dirname}/path_to_genome_list.txt", dirname=dirname)
  output:
        directory(expand("{dirname}/core_genome", dirname=dirname))
  shell:
        'python pipelines/CoreCruncher/corecruncher_master.py -in {input.dir} -out {output} -list {input.liste} -freq 85 -prog usearch -ext .fa -length 80 -score 70 -align mafft'

Here is the log of the error I am receiving:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 56
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   Gene_flow_between_species
    1   Gene_flow_between_species_all
    2

rule Gene_flow_between_species:
 input: Campylobacter/genomes, 
Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster2/path_to_genome_list.txt, Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster3/path_to_genome_list.txt, Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster4/path_to_genome_list.txt
    output: Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster2/core_genome, Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster3/core_genome, Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster4/core_genome
    jobid: 1

Waiting at most 5 seconds for missing files.
MissingOutputException in line 16 of / Users /adiop2/Bioinformatic_tool_awa/Snakefile:
Job completed successfully, but some output files are missing. Missing files after 5 seconds:
Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster3/core_genome
Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster4/core_genome
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

As for the config.yaml fileL

dirname:
 - Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster2
 - Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster3
 - Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster4

Snakemake genome • 1.0k views

ADD COMMENT • link updated 9 months ago by Ram 45k • written 9 months ago by diop.awa94 • 0

score 0 · Answer 1 · 2024-07-13

As you've written it, the rule Gene_flow_between_species produces all of the output directories in one go (one run), and has a list of the paths all the genome lists as input rather than just one at a time. Do you mean to have it run separately three times instead? Just drop those two expand() calls, if so (keep that only for the "all" rule) which will leave the dirname wildcard unspecified so Snakemake can decide what to fill in.

It's easier to troubleshoot shell-based rules if you include -p (--printshellcmds) so you can see exactly what command string is getting executed. In this case I think you'll see multiple directories for -out and files for -list where I'm betting you expect just one in the arguments to your script. (I constantly find myself appending the combo -nrp when I'm running snakemake to tell it do a dry run, tell me why jobs are going to be run, and print all shell commands. In version 8 forward -r is automatic, FYI.)

score 0 · Answer 2 · 2024-07-13

I think your rule should look like this, using normal wildcards:

rule Gene_flow_between_species:
input:
        dir = 'Campylobacter/genomes',
        liste = "{dirname}/path_to_genome_list.txt"
output:
        directory("{dirname}/core_genome")
shell:
       """
        python pipelines/CoreCruncher/corecruncher_master.py -in {input.dir} -out {output} -list {input.liste} -freq 85 -prog usearch -ext .fa -length 80 -score 70 -align mafft
       """

Then a DAG with 3 parallel jobs for each directory will be created. How to deal with 'expand' felt a bit counterintuitive to me. As a rule of thumb, you often need only one expand rule only at the topmost position of the workflow (rule all), then let the solver deal with creating the jobs.

When the 'all' rule is encountered, the array values in dirname are expanded.

rule Gene_flow_between_species_all:
input:
     expand("{dirname}/core_genome", dirname=dirname)

The solver looks for a rule that can generate:

directory_1/core_genome
directory_2/core_genome
.../core_genome

The pattern matching then finds that

rule Gene_flow_between_species_all matches {wildcards.dirname}/core_genome (which is equivalent to a regex (.+)/core_genome) and creates multiple instances jobs of that rule by replacing the wildcards with the input values required by the all rule.