I'm trying to use Snakemake to create an automated pipeline. Basically, the original input file is a list of sample names, and the workflow will pull the data from a server and then process it for WGS assembly with various different methods--flye, unicycler, trycycler, etc.
It worked fine with my test data, but on real data I'm running into issues. Essentially, the "trycycler subsample" step fails on some samples that are not of sufficient read depth. I'm going to include a mock-up of the code here.
rule all:
input:
expand(f'{output_dir}/{{sample}}.{{assembler}}', sample=sample_list, assembler=['flye', 'unicycler', 'trycycler'])
rule download_files:
input:
'sample_list.txt'
output:
temp('download_complete.temp')
shell:
f"""
rsync blah blah blah blah
touch download_complete.temp
"""
rule flye:
input:
'download_complete.temp'
output:
f'{output_dir}/{{sample}}.flye'
shell:
f"""
flye blah blah blah
"""
rule unicycler:
input:
'download_complete.temp'
output:
f'{output_dir}/{{sample}}.unicycler'
shell:
f"""
unicycler blah blah blah
"""
rule trycycler_subsample:
input:
'download_complete.temp'
output:
f'{output_dir}/{{sample}}.subsamples'
shell:
f"""
trycycler subsample blah blah blah
"""
rule trycycler_cluster:
input:
f'{output_dir}/{{sample}}.subsamples'
output:
f'{output_dir}/{{sample}}.trycycler'
shell:
f"""
trycycler cluster blah blah blah
"""
Ignoring the specifics of the actual shell commands here, this will compile to a DAG that looks something like this (please pardon my text art):
sample_list.txt
|
download_files
/ | \
flye unicycler trycycler_subsample
| | |
| | trycycler_cluster
| | |
\ | /
\ | /
all
Now this is pretty oversimplified; there are a few steps of preprocessing, etc. The rightmost chain in the DAG also has steps between trycyler_subsample and trycycler_cluster that involve performing mini-assemblies on the subsamples, before using trycycler_cluster to make a final assembly. For the purpose of this question that's irrelevant.
Essentially my question is this: if trycycler_subsample fails due to low read depth, and I have no way of knowing the read depth before the file is downloaded, how can I safely abort the rest of the rightmost chain without affecting/ending the entire pipeline? Snakemake still expects to see the folder f'{output_dir}/{{sample}}.trycycler', but there is no use proceeding with several extra steps in that chain if the very first subsample command exits in failure. I've seen some suggestions of using python to conditionally modify the target of rule all
, but again, this would need to occur before the script is ever executed and I have no way of assessing depth until after the files are downloaded.
What's the best recourse here?
if you're starting with snakemake, it's always time to switch to nextflow which is designed for this kind of problem. Otherwise I think you have to look at breakpoints (?)