Question

Wildcards and regular expressions in snakemake

0

Entering edit mode

3 months ago

diop.awa94 • 0

I'm trying to create wildcards with some folders/directories names that are output from rule ReferenceDatabase that created the Campylobacter/Gene_Flow/ReferenceDatabase/{dirname} folder ( cluster1, cluster2, ... correspond to dirname wildcards) but I'm not able to know how many "cluster" directories will be created at the first time this rule is running. So I tried to write the Snakefile as below:

import glob

# Need sample name and dirname
SAMPLES, = glob_wildcards("Campylobacter/core_genome/core/{sample}.fa.align")

dirnames, = glob_wildcards("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}", "Campylobacter/Gene_Flow/DatabaseQuery/{dirname}/{dirname}")

wildcard_constraints:
    dirname="cluster[0-9]+"

rule all:
    input:
        distmat_out = "Campylobacter/ANI_results/ani/ani.distmat",
        parse_distances_out = "Campylobacter/ANI_results/genome_pairs.csv",
        cluster_genomes_out = "Campylobacter/ANI_results/cluster_genomes.csv",
        liste_genomes = expand("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/path_to_genome_list.txt", dirname=dirnames),
        core_genome_within_species = expand("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome/concat.fa", dirname=dirnames),
        distances_between_genomes_r = expand("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome/distances.dist", dirname=dirnames)

rule define_ANI_species:
    input:
        fasta = "Campylobacter/core_genome/concat.fa",
        dir = "Campylobacter"
    output:
        distmat = "Campylobacter/ANI_results/ani/ani.distmat",
        parse_distances = "Campylobacter/ANI_results/genome_pairs.csv",
        cluster_genomes = "Campylobacter/ANI_results/cluster_genomes.csv",
    shell:
        """
        mkdir -p Campylobacter/ANI_results/ani
        distmat  -sequence {input.fasta} -nucmethod 0 -outfile {output.distmat}
        python pipelines/ANI/parse_distances.py {input.dir}
        python pipelines/ANI/cluster_genomes.py {input.dir}
        """
rule ReferenceDatabase:
    input:
        cluster_genomes = "Campylobacter/ANI_results/cluster_genomes.csv",
        dir = "Campylobacter"
    output:
        liste = "Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/path_to_genome_list.txt"
    shell:
        "python pipelines/ConSpecifix/create_Refdb.py {input.dir}"

rule core_genome_within_species:
    input:
        dir = "Campylobacter/genomes",
        liste = "Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/path_to_genome_list.txt"
    output:
        fasta = "Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome/concat.fa",
        family = "Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome/families_core.txt"
    params:
        dir = directory("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome")
    shell:
        "python pipelines/CoreCruncher/corecruncher_master.py -in {input.dir} -out {params.dir} -list {input.liste} -freq 85 -prog usearch -ext .fa -length 80 -score 70 -align mafft"

I got this error:

rule ReferenceDatabase:
    input: Campylobacter/ANI_results/genome_clusters.csv, Campylobacter
    output: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt
    jobid: 18
    wildcards: dirname=cluster[0-9]+
    Waiting at most 5 seconds for missing files.
    MissingOutputException in line 171 of /Users/home//Bioinformatic_tool/Snakefile:
    Job completed successfully, but some output files are missing. Missing files after 5 seconds:
    Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt
    This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.

It seems that snakemake does not recognize the regex used [0-9]+

Is there like a wildcard for an int that I can use to match: cluster1, cluster2 , cluster3 ...? (directory1, directory2, directory3 ...?)

snakemake • 488 views

ADD COMMENT • link updated 3 months ago by tim.booth ▴ 110 • written 3 months ago by diop.awa94 • 0

score 1 · Answer 1 · 2024-08-09

1

Entering edit mode

3 months ago

tim.booth ▴ 110

Your regex looks fine to me, but...

The log indicates that you are trying to make an output file literally named cluster[0-9]+/path_to_genome_list.txt which is clearly not right. I'm not sure where that came from. Maybe something on your Snakemake command line, which we don't see? Or maybe you accidentally created a directory with this name already and the glob() is picking it up?

It's not possible to have an indeterminate number of outputs from a Snakemake rule, because Snakemake always resolves the output of a job first, then the inputs. Then it builds up a DAG. Then (assuming you didn't do a --dry-run) it runs the shell commands. So nothing that happens in the shell commands can affect the names of the output files. There used to be a thing called "dynamic" outputs that tried to make this happen, but they were very problematic and got removed.

The solution is pretty simple. If you have a clustering step that produces an indeterminate number of outputs then you have to break your workflow in two. The first part of the workflow runs the "ReferenceDatabase" rule, then the second part can use glob() or glob_wildcards() to discover how many outputs were made and build a DAG accordingly.

There's a feature in Snakemake called "checkpoints" which is the replacement for "dynamic outputs". Checkpoints work well, but basically all they do is to allow you to run both parts of your workflow in one go, without invoking Snakemake twice. There's no point trying to use checkpoints to combine your workflow until that workflow is tested and working in two separate stages.

Hope that makes sense.

ADD COMMENT • link 3 months ago by tim.booth ▴ 110

0

Entering edit mode

Thanks tim for your answer I did a dry-run and here is the result:

Building DAG of jobs...
Job counts:
count   jobs
1   ReferenceDatabase
1   all
1   define_ANI_species
1   core_genome_within_species
4

[Fri Aug  9 08:55:23 2024]
rule define_ANI_species:
input: Campylobacter, Campylobacter/core_genome/concat.fa
output: Campylobacter/ANI_results/cluster_genomes.csv, Campylobacter/ANI_results/genome_pairs.csv, Campylobacter/ANI_results/ani/ani.distmat
jobid: 1
reason: Missing output files: Campylobacter/ANI_results/ani/ani.distmat, Campylobacter/ANI_results/genome_pairs.csv, Campylobacter/ANI_results/cluster_genomes.csv; Input files updated by another job: Campylobacter/core_genome/concat.fa


    mkdir Campylobacter/ANI_results
    mkdir Campylobacter/ANI_results/ani
    distmat  -sequence Campylobacter/core_genome/concat.fa -nucmethod 0 -outfile Campylobacter/ANI_results/ani/ani.distmat
    python pipelines/ANI/parse_distances.py Campylobacter
    python pipelines/ANI/cluster_genomes.py Campylobacter


[Fri Aug  9 08:55:23 2024]
rule ReferenceDatabase:
input: Campylobacter/ANI_results/genome_clusters.csv, Campylobacter
output: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt
jobid: 19
reason: Missing output files: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt
wildcards: dirname=cluster[0-9]+

python pipelines/ConSpecifix/create_Refdb.py Campylobacter

[Fri Aug  9 08:55:23 2024]
rule core_genome_within_species:
input: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt, Campylobacter/genomes
output: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/core_genome/families_core.txt, Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/core_genome/concat.fa
jobid: 3
reason: Missing output files: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/core_genome/concat.fa; Input files updated by another job: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt
wildcards: dirname=cluster[0-9]+

python pipelines/CoreCruncher/corecruncher_master.py -in Campylobacter/genomes -out       Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/core_genome -list Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt -freq 85 -prog usearch -ext .fa -length 80 -score 70 -align mafft

It looks like it consider cluster[0-9]+ as dirname and not cluster1, cluster2 cluster3 ....

Because when I run the workflow the ouputs directories (cluster1, cluster2, cluster3 ....) were created and also cluster[0-9]+ was created. And I don't know why it created cluster[0-9]+

ADD REPLY • link 3 months ago by diop.awa94 • 0

0

Entering edit mode

I wanted to add that: when I did a dry-run with the workflow above I got:

Building DAG of jobs... Nothing to be done.

but when I change the top of my workflow as like this:

dirname, = glob_wildcards("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}", "Campylobacter/Gene_Flow/DatabaseQuery/{dirname}/{dirname}")
dirname="cluster[0-9]+"

I got the first result of dry-run posted above and copied again here

Building DAG of jobs...
Job counts:
count   jobs
1   ReferenceDatabase
1   all
1   define_ANI_species
1   core_genome_within_species
4

[Fri Aug  9 08:55:23 2024]
rule define_ANI_species:
input: Campylobacter, Campylobacter/core_genome/concat.fa
output: Campylobacter/ANI_results/cluster_genomes.csv, Campylobacter/ANI_results/genome_pairs.csv, Campylobacter/ANI_results/ani/ani.distmat
jobid: 1
reason: Missing output files: Campylobacter/ANI_results/ani/ani.distmat, Campylobacter/ANI_results/genome_pairs.csv, Campylobacter/ANI_results/cluster_genomes.csv; Input files updated by another job: Campylobacter/core_genome/concat.fa


    mkdir Campylobacter/ANI_results
    mkdir Campylobacter/ANI_results/ani
    distmat  -sequence Campylobacter/core_genome/concat.fa -nucmethod 0 -outfile Campylobacter/ANI_results/ani/ani.distmat
    python pipelines/ANI/parse_distances.py Campylobacter
    python pipelines/ANI/cluster_genomes.py Campylobacter


[Fri Aug  9 08:55:23 2024]
rule ReferenceDatabase:
input: Campylobacter/ANI_results/genome_clusters.csv, Campylobacter
output: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt
jobid: 19
reason: Missing output files: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt
wildcards: dirname=cluster[0-9]+

python pipelines/ConSpecifix/create_Refdb.py Campylobacter

[Fri Aug  9 08:55:23 2024]
rule core_genome_within_species:
input: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt, Campylobacter/genomes
output: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/core_genome/families_core.txt, Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/core_genome/concat.fa
jobid: 3
reason: Missing output files: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/core_genome/concat.fa; Input files updated by another job: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt
wildcards: dirname=cluster[0-9]+

python pipelines/CoreCruncher/corecruncher_master.py -in Campylobacter/genomes -out Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/core_genome -list Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt -freq 85 -prog usearch -ext .fa -length 80 -score 70 -align mafft

ADD REPLY • link 3 months ago by diop.awa94 • 0

1

Entering edit mode

Right, so if you if you simply add the line dirname="cluster[0-9]+" with no wildcard_contraints: header then that sets the global variable dirname to this string, and then this string is used in the expand() expression in your all rule, so you are instructing Snakemake to make a file called:

Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt

You may look at that and see a regex pattern, but Snakemake is not like the Bash shell where glob patterns are implicitly matched. If you want to do a regex match you have to be explicit. Snakemake just sees this as a filename. And whenever Snakemake runs a job it first creates directories for all the output files, before it runs your shell: code.

So when you say "I don't know why it created cluster[0-9]+", this is the reason. But as I said before, there is no simple fix that will make Snakemake do what you want it to do for your workflow. If you have a rule that makes an indeterminate number of outputs (which is what you seem to have here) you need to structure your workflow as I suggested in the previous answer.

ADD REPLY • link 3 months ago by tim.booth ▴ 110