Hello,
I'm trying to build my first Snakemake worklflow. I've got four sets of short reads for two samples (left/right reads for samples HB001_naive/coevolved) that I'd like to correct with a script bbduk.sh. I'm trying to write a rule for that:
SAMPLES=["HB001_naive", "HB001_coevolved"]
READ_SETS=["1", "2"]
rule all:
input:
expand("1_reads_qc/{sample}/{sample}_{read_set}.bbduk.fq.gz", sample=SAMPLES, read_set=READ_SETS)
rule bbduk:
input:
expand("0_reads_raw/{sample}_{read_set}.fq.gz", sample=SAMPLES, read_set=READ_SETS),
output:
expand("1_reads_qc/{sample}/{sample}_{read_set}.bbduk.fq.gz", sample=SAMPLES, read_set=READ_SETS),
conda:
"envs/bbtools.yaml"
threads: 12
shell:
"bbduk.sh -Xmx60g t={threads} \
in1={input} in2={input} \
out1={output} out2={output} \
ref=/home/scro4331/.conda/envs/bbtools/opt/bbmap-38.18/resources/adapters.fa \
ktrim=r k=23 mink=11 hdist=2 maq=10 minlen=100 tpe tbo \
stats=bbduk.contaminants"
This lists all the input files after in1=
and in2=
, same for output. However, I'd obviously like Snakemake to consider first only the naive read sets (as in1=0_reads_raw/HB001_naive_1.fq.gz
and in2=in1=0_reads_raw/HB001_naive_2.fq.gz
), then the coevolved ones.
Any suggestions on how to make it work, allowing flexibility in adding more samples? I'd appreciate any help!
Thank you, this makes sense! I actually tried splitting the input, but clearly I did something wrong... There is one thing I still don't understand, however: when I run this without
rule all
I getTarget rules may not contain wildcards.
On the other hand, the first example in the Snakemake tutorial (https://snakemake.readthedocs.io/en/stable/tutorial/basics.html#step-2-generalizing-the-read-mapping-rule) uses a wildcard beforerule all
is introduced. Why does my code not work while the example does??Snakemake starts from the final output file of your (desired) workflow (which matches the declared target file input), then goes backwards to check which rule it has to launch. Without passing an expansion into a target rule, Snakemake doesn't know how to expand wildcards. In the example posted, they launched snakemake passing the explicit output file requested
So, by writing
Snakemake expands text between the slash and the comma into a list,
sample
, following theoutput
pattern; then it tries to embed, for each item into thesample
list, the same value inside theinput
expression. That's why you need to always have the same matching pattern in output/input, which could be easly propagated into many rules' input/output using always same wildcards.Makes sense, thank you!!