After way too much testing, I designed a pretty inelegant solution that I needed to be able to incorporate a PON into my Mutect2 snakemake pipeline.
The steps are detailed here:
- run Mutect2 on all normals
- make an .args file that contains the names of all the vcfs you generated in step 1
- use that file to create the final panel
Easy enough, right? My problem was with step 2. Because you can only run step 3 by either giving the .args file OR by giving each file separate but preceded by the -vcf
option argument, I thought it would be easier to make the args file instead of figuring out how to make snakemake gather and then split out input arguments.
My solution:
rule run_mutect:
input:
"input/{sample}.txt"
output:
"mutect/{sample}.txt",
shell:
# simulate some output value
"echo {input} > {output}"
rule make_panel:
input:
expand("mutect/{normal}.txt", normal = samples["fruits"])
output:
"pon.args"
params:
sed_regex = "s/ /\\n/g"
run:
with open("pon.args", 'a') as f:
for item in {input}:
f.write("%s\n" % item)
shell("sed -i {params.sed_regex:q} pon.args")
If you want to test out a minimal working example, put the above code into a file called pon.smk
.
Then make a few other files (so it's technically not minimal but I did this to mimic the structure of my pipeline so that I wouldn't miss something obvious related to the design of the entire pipeline):
common.smk
:
import pandas as pd
samples = pd.read_table("samples.txt", names = ["fruits"], header = None)
def output_rules():
output1 = expand("mutect/{sample}.txt", sample = samples["fruits"])
output2 = "pon.args"
return output1, output2
Snakefile
:
include: "common.smk"
include: "pon.smk"
# a target rule to define the desired final output
rule all:
input:
output_rules()
and finally a mini-samples file:
echo -e "kiwi\npotato\nstrawberry" >> samples.txt
What I don't like about my solution:
- hard to read
- doesn't actually work the way I expected it to
I don't understand why the expanded input I gave is not written to the file line by line, so I have to go in after with a sed
command to fix that. I tried to follow examples that I found online and had a solution with a checkpoint at one point but I couldn't get it quite right. This is the only thing I got working but I was wondering if others have better solutions.
While I think checkpoint is an elegant solution to many conditional executions, I don't think checkpoint is needed here. Besides, checkpoints can quickly become difficult to read and debug. I think your approach using an expand is great.
Does the following make it easier to read and work the way you expect? It's basically the same approach except I moved samples to configfile.
samples.yaml:
Thank you so much for your input. My main misunderstanding came from how Snakemake makes an expanded list of samples-- I thought it was processed as a list. Your code works well for me.