Question

Is it possible to make certain branches of Snakemake conditional?

0

Entering edit mode

2.3 years ago

edsouza ▴ 30

I'm trying to use Snakemake to create an automated pipeline. Basically, the original input file is a list of sample names, and the workflow will pull the data from a server and then process it for WGS assembly with various different methods--flye, unicycler, trycycler, etc.

It worked fine with my test data, but on real data I'm running into issues. Essentially, the "trycycler subsample" step fails on some samples that are not of sufficient read depth. I'm going to include a mock-up of the code here.


rule all:
    input:
        expand(f'{output_dir}/{{sample}}.{{assembler}}', sample=sample_list, assembler=['flye', 'unicycler', 'trycycler'])

rule download_files:
    input:
        'sample_list.txt'
    output:
        temp('download_complete.temp')
    shell:
        f"""
        rsync blah blah blah blah
        touch download_complete.temp
        """

rule flye:
    input:
        'download_complete.temp'
    output:
        f'{output_dir}/{{sample}}.flye'
    shell:
        f"""
        flye blah blah blah
        """

rule unicycler:
    input:
        'download_complete.temp'
    output:
        f'{output_dir}/{{sample}}.unicycler'
    shell:
        f"""
        unicycler blah blah blah
        """

rule trycycler_subsample:
    input:
        'download_complete.temp'
    output:
        f'{output_dir}/{{sample}}.subsamples'
    shell:
        f"""
        trycycler subsample blah blah blah
        """

rule trycycler_cluster:
    input:
        f'{output_dir}/{{sample}}.subsamples'
    output:
        f'{output_dir}/{{sample}}.trycycler'
    shell:
        f"""
        trycycler cluster blah blah blah
        """

Ignoring the specifics of the actual shell commands here, this will compile to a DAG that looks something like this (please pardon my text art):

                    sample_list.txt
                          |
                    download_files
                   /      |        \
              flye    unicycler     trycycler_subsample
               |          |         |
               |          |         trycycler_cluster
               |          |         |
               \          |        /      
                \         |       /
                        all

Now this is pretty oversimplified; there are a few steps of preprocessing, etc. The rightmost chain in the DAG also has steps between trycyler_subsample and trycycler_cluster that involve performing mini-assemblies on the subsamples, before using trycycler_cluster to make a final assembly. For the purpose of this question that's irrelevant.

Essentially my question is this: if trycycler_subsample fails due to low read depth, and I have no way of knowing the read depth before the file is downloaded, how can I safely abort the rest of the rightmost chain without affecting/ending the entire pipeline? Snakemake still expects to see the folder f'{output_dir}/{{sample}}.trycycler', but there is no use proceeding with several extra steps in that chain if the very first subsample command exits in failure. I've seen some suggestions of using python to conditionally modify the target of rule all, but again, this would need to occur before the script is ever executed and I have no way of assessing depth until after the files are downloaded.

What's the best recourse here?

python snakemake wgs workflow assembly • 1.4k views

ADD COMMENT • link updated 2.3 years ago by Pierre Lindenbaum 164k • written 2.3 years ago by edsouza ▴ 30

0

Entering edit mode

if you're starting with snakemake, it's always time to switch to nextflow which is designed for this kind of problem. Otherwise I think you have to look at breakpoints (?)

ADD REPLY • link 2.3 years ago by Pierre Lindenbaum 164k

score 0 · Answer 1 · 2022-08-17

0

Entering edit mode

2.3 years ago

edsouza ▴ 30

Commenting to update: it seems that this issue is solved not at the level of the program, but at the level of the command line. snakemake -k or snakemake --keep-going will allow the pipeline to continue as many nodes of the DAG as possible despite failure in one of the branches.

ADD COMMENT • link 2.3 years ago by edsouza ▴ 30

0

Entering edit mode

The --keep-going option is really handy. You could also encapsulate the trycycler subsample failure within a bit more shell logic so Snakemake never has to directly see the nonzero exit code, and then rework the DAG a bit to route around that case for downstream steps. That could use the checkpoints feature (I think that's what Pierre meant by breakpoints?) and maybe ruleorder to have it branch off differently depending on what happens from that subsample step. But if you just want to have it hammer away at the rules until it's done everything it can successfully do then maybe you're all set already.

ADD REPLY • link 2.3 years ago by Jesse ▴ 850

0

Entering edit mode

I think that's what Pierre meant by breakpoints?

yes !

ADD REPLY • link 2.3 years ago by Pierre Lindenbaum 164k