I'm trying to run a workflow on snakemake. I have to automate a couple of steps which are depending all on python scripts or pipelines already made. My rule Gene_flow_between _species has to run once on each list of genomes and build a core-genome as output for each run (for a total of three runs). Snakemake seems to be able to locate the input files and to start to complete the first steps of the first run in the workflow, but for some reason exits out with MissingOutput Exception error stating it cannot find the output files for the second and the third runs of the workflow.
The output should be written in the directory core_genome and contains some files and subdirectories. (the input.dir contains all the genomes and the input.liste is a file in which I listed the selected genomes to analyze for each run and the 'output directory' is the folder core_genome)
So far, I did this (which is not working, as we can expect)
configfile: "config_cand.yaml"
dirname = config["dirname"]
rule Gene_flow_between_species_all:
input:
expand("{dirname}/core_genome", dirname=dirname)
rule Gene_flow_between_species:
input:
dir = 'Campylobacter/genomes',
liste = expand("{dirname}/path_to_genome_list.txt", dirname=dirname)
output:
directory(expand("{dirname}/core_genome", dirname=dirname))
shell:
'python pipelines/CoreCruncher/corecruncher_master.py -in {input.dir} -out {output} -list {input.liste} -freq 85 -prog usearch -ext .fa -length 80 -score 70 -align mafft'
Here is the log of the error I am receiving:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 56
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 Gene_flow_between_species
1 Gene_flow_between_species_all
2
rule Gene_flow_between_species:
input: Campylobacter/genomes,
Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster2/path_to_genome_list.txt, Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster3/path_to_genome_list.txt, Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster4/path_to_genome_list.txt
output: Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster2/core_genome, Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster3/core_genome, Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster4/core_genome
jobid: 1
Waiting at most 5 seconds for missing files.
MissingOutputException in line 16 of / Users /adiop2/Bioinformatic_tool_awa/Snakefile:
Job completed successfully, but some output files are missing. Missing files after 5 seconds:
Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster3/core_genome
Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster4/core_genome
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
As for the config.yaml fileL
dirname:
- Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster2
- Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster3
- Campylobacter/Gene_Flow/DatabaseQuery/cluster1/cluster4
Oh wow it worked! Thank you so much, that makes a lot more sense! Thanks Jesse!