Missing Output Exception Error in Snakemake with output directory
2
0
Entering edit mode
4 months ago
diop.awa94 • 0

I'm trying to run a workflow on snakemake. I have to automate a couple of steps which are depending all on python scripts or pipelines already made. My rule Gene_flow_between _species has to run once on each list of genomes and build a core-genome as output for each run (for a total of three runs). Snakemake seems to be able to locate the input files and to start to complete the first steps of the first run in the workflow, but for some reason exits out with MissingOutput Exception error stating it cannot find the output files for the second and the third runs of the workflow.

The output should be written in the directory core_genome and contains some files and subdirectories. (the input.dir contains all the genomes and the input.liste is a file in which I listed the selected genomes to analyze for each run and the 'output directory' is the folder core_genome)

So far, I did this (which is not working, as we can expect)

configfile: "config_cand.yaml"
dirname = config["dirname"]

rule Gene_flow_between_species_all:
   input:
         expand("{dirname}/core_genome", dirname=dirname)

rule Gene_flow_between_species:
  input:
        dir = 'Campylobacter/genomes',
        liste = expand("{dirname}/path_to_genome_list.txt", dirname=dirname)
  output:
        directory(expand("{dirname}/core_genome", dirname=dirname))
  shell:
        'python pipelines/CoreCruncher/corecruncher_master.py -in {input.dir} -out {output} -list {input.liste} -freq 85 -prog usearch -ext .fa -length 80 -score 70 -align mafft'

Here is the log of the error I am receiving:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 56
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   Gene_flow_between_species
    1   Gene_flow_between_species_all
    2

rule Gene_flow_between_species:
 input: Campylobacter/genomes, 
Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster2/path_to_genome_list.txt, Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster3/path_to_genome_list.txt, Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster4/path_to_genome_list.txt
    output: Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster2/core_genome, Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster3/core_genome, Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster4/core_genome
    jobid: 1

Waiting at most 5 seconds for missing files.
MissingOutputException in line 16 of / Users /adiop2/Bioinformatic_tool_awa/Snakefile:
Job completed successfully, but some output files are missing. Missing files after 5 seconds:
Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster3/core_genome
Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster4/core_genome
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

As for the config.yaml fileL

dirname:
 - Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster2
 - Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster3
 - Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster4
Snakemake genome • 562 views
ADD COMMENT
0
Entering edit mode
4 months ago
Jesse ▴ 850

As you've written it, the rule Gene_flow_between_species produces all of the output directories in one go (one run), and has a list of the paths all the genome lists as input rather than just one at a time. Do you mean to have it run separately three times instead? Just drop those two expand() calls, if so (keep that only for the "all" rule) which will leave the dirname wildcard unspecified so Snakemake can decide what to fill in.

It's easier to troubleshoot shell-based rules if you include -p (--printshellcmds) so you can see exactly what command string is getting executed. In this case I think you'll see multiple directories for -out and files for -list where I'm betting you expect just one in the arguments to your script. (I constantly find myself appending the combo -nrp when I'm running snakemake to tell it do a dry run, tell me why jobs are going to be run, and print all shell commands. In version 8 forward -r is automatic, FYI.)

ADD COMMENT
0
Entering edit mode

Oh wow it worked! Thank you so much, that makes a lot more sense! Thanks Jesse!

ADD REPLY
0
Entering edit mode
4 months ago
Michael 55k

I think your rule should look like this, using normal wildcards:

rule Gene_flow_between_species:
input:
        dir = 'Campylobacter/genomes',
        liste = "{dirname}/path_to_genome_list.txt"
output:
        directory("{dirname}/core_genome")
shell:
       """
        python pipelines/CoreCruncher/corecruncher_master.py -in {input.dir} -out {output} -list {input.liste} -freq 85 -prog usearch -ext .fa -length 80 -score 70 -align mafft
       """

Then a DAG with 3 parallel jobs for each directory will be created. How to deal with 'expand' felt a bit counterintuitive to me. As a rule of thumb, you often need only one expand rule only at the topmost position of the workflow (rule all), then let the solver deal with creating the jobs.

When the 'all' rule is encountered, the array values in dirname are expanded.

rule Gene_flow_between_species_all:
input:
     expand("{dirname}/core_genome", dirname=dirname)

The solver looks for a rule that can generate:

  • directory_1/core_genome
  • directory_2/core_genome
  • .../core_genome

The pattern matching then finds that

rule Gene_flow_between_species_all matches {wildcards.dirname}/core_genome (which is equivalent to a regex (.+)/core_genome) and creates multiple instances jobs of that rule by replacing the wildcards with the input values required by the all rule.

ADD COMMENT
0
Entering edit mode

Excellent, this is exactly what I needed. Thank you so much, that makes a lot more sense! Thanks Michael for the help again!

ADD REPLY

Login before adding your answer.

Traffic: 2074 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6