I'm trying to create wildcards with some folders/directories names that are output from rule ReferenceDatabase that created the Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}
folder ( cluster1, cluster2, ... correspond to dirname wildcards) but I'm not able to know how many "cluster" directories will be created at the first time this rule is running. So I tried to write the Snakefile as below:
import glob
# Need sample name and dirname
SAMPLES, = glob_wildcards("Campylobacter/core_genome/core/{sample}.fa.align")
dirnames, = glob_wildcards("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}", "Campylobacter/Gene_Flow/DatabaseQuery/{dirname}/{dirname}")
wildcard_constraints:
dirname="cluster[0-9]+"
rule all:
input:
distmat_out = "Campylobacter/ANI_results/ani/ani.distmat",
parse_distances_out = "Campylobacter/ANI_results/genome_pairs.csv",
cluster_genomes_out = "Campylobacter/ANI_results/cluster_genomes.csv",
liste_genomes = expand("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/path_to_genome_list.txt", dirname=dirnames),
core_genome_within_species = expand("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome/concat.fa", dirname=dirnames),
distances_between_genomes_r = expand("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome/distances.dist", dirname=dirnames)
rule define_ANI_species:
input:
fasta = "Campylobacter/core_genome/concat.fa",
dir = "Campylobacter"
output:
distmat = "Campylobacter/ANI_results/ani/ani.distmat",
parse_distances = "Campylobacter/ANI_results/genome_pairs.csv",
cluster_genomes = "Campylobacter/ANI_results/cluster_genomes.csv",
shell:
"""
mkdir -p Campylobacter/ANI_results/ani
distmat -sequence {input.fasta} -nucmethod 0 -outfile {output.distmat}
python pipelines/ANI/parse_distances.py {input.dir}
python pipelines/ANI/cluster_genomes.py {input.dir}
"""
rule ReferenceDatabase:
input:
cluster_genomes = "Campylobacter/ANI_results/cluster_genomes.csv",
dir = "Campylobacter"
output:
liste = "Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/path_to_genome_list.txt"
shell:
"python pipelines/ConSpecifix/create_Refdb.py {input.dir}"
rule core_genome_within_species:
input:
dir = "Campylobacter/genomes",
liste = "Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/path_to_genome_list.txt"
output:
fasta = "Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome/concat.fa",
family = "Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome/families_core.txt"
params:
dir = directory("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome")
shell:
"python pipelines/CoreCruncher/corecruncher_master.py -in {input.dir} -out {params.dir} -list {input.liste} -freq 85 -prog usearch -ext .fa -length 80 -score 70 -align mafft"
I got this error:
rule ReferenceDatabase:
input: Campylobacter/ANI_results/genome_clusters.csv, Campylobacter
output: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt
jobid: 18
wildcards: dirname=cluster[0-9]+
Waiting at most 5 seconds for missing files.
MissingOutputException in line 171 of /Users/home//Bioinformatic_tool/Snakefile:
Job completed successfully, but some output files are missing. Missing files after 5 seconds:
Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
It seems that snakemake does not recognize the regex used [0-9]+
Is there like a wildcard for an int that I can use to match: cluster1, cluster2 , cluster3 ...? (directory1, directory2, directory3 ...?)
Thanks tim for your answer I did a dry-run and here is the result:
It looks like it consider cluster[0-9]+ as dirname and not cluster1, cluster2 cluster3 ....
Because when I run the workflow the ouputs directories (cluster1, cluster2, cluster3 ....) were created and also cluster[0-9]+ was created. And I don't know why it created cluster[0-9]+
I wanted to add that: when I did a dry-run with the workflow above I got:
but when I change the top of my workflow as like this:
I got the first result of dry-run posted above and copied again here
Right, so if you if you simply add the line
dirname="cluster[0-9]+"
with nowildcard_contraints:
header then that sets the global variabledirname
to this string, and then this string is used in theexpand()
expression in yourall
rule, so you are instructing Snakemake to make a file called:Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt
You may look at that and see a regex pattern, but Snakemake is not like the Bash shell where glob patterns are implicitly matched. If you want to do a regex match you have to be explicit. Snakemake just sees this as a filename. And whenever Snakemake runs a job it first creates directories for all the output files, before it runs your
shell:
code.So when you say "I don't know why it created cluster[0-9]+", this is the reason. But as I said before, there is no simple fix that will make Snakemake do what you want it to do for your workflow. If you have a rule that makes an indeterminate number of outputs (which is what you seem to have here) you need to structure your workflow as I suggested in the previous answer.