I'm including a snipped of my code.
The following block generates two groups, I'm setting the the number of samples to 1 for easier debugging. The samples in groupOne
would be [FOO_A]
and samples in groupTwo
would be [BAR_A]
num_of_samples = 1
groupOne = ["FOO_" + chr(alpha+97).capitalize() for alpha in range(0,num_of_samples)]
groupTwo = ["BAR_" + chr(alpha+97).capitalize() for alpha in range(0,num_of_samples)]
Below is my first rule (rule all
) and two rules generating my reads, and two rules generating bowtie2
alignment (the index is prebuilt and not included here)
rule all:
input:
expand("aligned/ref_{chrom}.{gTwoSam}.bam", chrom=chromosomeNames, gTwoSam = groupTwo),
expand("aligned/ref_{chrom}.{gOneSam}.bam", chrom=chromosomeNames, gOneSam = groupOne),
rule artillumina_simulate_ref_group1:
input:
"reference/{chrom}.fasta"
output:
"simulated/{chrom}.{gOneSam}.1.fq",
"simulated/{chrom}.{gOneSam}.2.fq"
params:
name = "simulated/{chrom}.{gOneSam}."
shell:
"""
art_illumina -ss HS20 -f 9 -l 100 -p -m 360 -s 80 -na -i {input} -o {params.name}
"""
rule artillumina_simulate_ref_group2:
input:
"reference/{chrom}.fasta"
output:
"simulated/ref_{chrom}.{gTwoSam}.1.fq",
"simulated/ref_{chrom}.{gTwoSam}.2.fq"
params:
name = "simulated/ref_{chrom}.{gTwoSam}."
shell:
"""
art_illumina -ss HS20 -f 9 -l 100 -p -m 360 -s 80 -na -i {input} -o {params.name}
"""
rule botwie2_align_ref_group1:
input:
"simulated/ref_{chrom}.{gOneSam}.1.fq",
"simulated/ref_{chrom}.{gOneSam}.2.fq",
"reference/{chrom}.rev.1.bt2", "reference/{chrom}.rev.2.bt2",
"reference/{chrom}.1.bt2", "reference/{chrom}.2.bt2",
"reference/{chrom}.3.bt2", "reference/{chrom}.4.bt2"
params:
btindex = "reference/{chrom}"
output:
"aligned/ref_{chrom}.{gOneSam}.bam"
shell:
"""
bowtie2 -x {params.btindex} -1 {input[0]} -2 {input[1]} | samtools view -bS > {output}
"""
rule botwie2_align_ref_group2:
input:
"simulated/ref_{chrom}.{gTwoSam}.1.fq",
"simulated/ref_{chrom}.{gTwoSam}.2.fq",
"reference/{chrom}.rev.1.bt2", "reference/{chrom}.rev.2.bt2",
"reference/{chrom}.1.bt2", "reference/{chrom}.2.bt2",
"reference/{chrom}.3.bt2", "reference/{chrom}.4.bt2"
params:
btindex = "reference/{chrom}"
output:
"aligned/ref_{chrom}.{gTwoSam}.bam"
shell:
"""
bowtie2 -x {params.btindex} -1 {input[0]} -2 {input[1]} | samtools view -bS > {output}
"""
Running the snakemake
gives me the following error:
Rules botwie2_align_ref_group1 and botwie2_align_ref_group2 are ambiguous for the file aligned/ref_CM008309.1.BAR_A.bam.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
botwie2_align_ref_group1: chrom=CM008309.1,gOneSam=BAR_A
botwie2_align_ref_group2: chrom=CM008309.1,gTwoSam=BAR_A
So it would seem that my wildcard gTwoSam
overwrites the gOneSam
one.
However, modifying the rule all
to collect output of artillumina_simulate_ref_group
like this:
rule all:
input:
expand("simulated/ref_{chrom}.{gOneSam}.1.fq",chrom=chromosomeNames,gOneSam = groupOne),
expand("simulated/ref_{chrom}.{gOneSam}.2.fq",chrom=chromosomeNames,gOneSam = groupOne),
expand("simulated/ref_{chrom}.{gTwoSam}.1.fq",chrom=chromosomeNames,gTwoSam = groupTwo),
expand("simulated/ref_{chrom}.{gTwoSam}.2.fq",chrom=chromosomeNames,gTwoSam = groupTwo),
And running it dry mode everything works.
So why are my wildcards overwriting one another.
All I can say is, you're trying to confuse snakemake as well as humans. Your rules are almost identical; the only difference is, FASTQ-output rules don't have input files mapped to ambiguous output wildcards while the BAM-output rules do. This is why snakemake is able to resolve the former and not the latter.
Do yourself a favor and remove the duplicates. Whatever you're trying to do with this approach can be done using a much better much less ambiguous approach.
Sure enough, when I remove rule
bowtie2_align_ref_group2
everything works well enough - rule also generates samples specified ingroupTwo
. I've extended the concept to theartillumina_simulate__ref_group
. But my question is why does this happen. I was under the impression that specifying wildcards upstream in therule all
constrains samples - e.g.gOneSam
can only haveFOO
values, andgTwoSam
can only haveBAR
values.The outputs of your simulation rules are much the same regarding snakemake. It can deduce from your rule all that first wildcards is indeed a chromosome but the second one can either be
{gOneSam}
or{gTwoSam}
. Indeed, a wildcard is not a global value, it is purely relative to your rule. Therefore it is not because 2 wildcards have the same name in two rules that they will share the same value. See them as unknown values that will be deduced from the outputs of a downstream rule (here your rule all).If you really do want to run it this way, I would say replacing these two wildcards by
BAR_{N}
andFOO_{N}
withN=[chr(alpha+97).capitalize() for alpha in range(0,num_of_samples)]
should constrain the value of your wildcards.Just as a general remark, since your rules simulate and align are doing the exact same thing for your 2 groups, it would be easier to have only one rule by step, with a wildcard
{group}
withgroup=groupOne+groupTwo
.