how to get around snakemake with different wildcards input to output rule
2
0
Entering edit mode
22 months ago
samuel ▴ 260

I am trying to write a rule that has a different wildcard in the input to the output like so:

  rule MarkDuplicates:
        input:
            L01="result/gatk4/{fc}_L01_{index}_piped.bam",
            L02="result/gatk4/{fc}_L02_{index}_piped.bam",
            L03="result/gatk4/{fc}_L03_{index}_piped.bam",
            L04="result/gatk4/{fc}_L04_{index}_piped.bam"
        output:
            bam="result/gatk4/samplename_index{index}_markedduplicates.bam",
            txt="result/gatk4/samplename_index{index}_markedduplicates_metrics.txt"
        log:
            "result/logs/markduplicates/samplename_index{index}.out"
        benchmark:
           "result/benchmarks/samplename_index{index}.md.out"
        container:
            config["containers"]["gatk4"]
        threads: 8
        shell:
            """
            gatk --java-options '-Xmx30G' MarkDuplicates
            I={params.L01} I={params.L02} I={params.L03} I={params.L04}
            O={output.bam}
            M={output.txt}
            TMP_DIR=`pwd`/tmp
            2>{log}
            """

But I know that you have to have matching wildcards in the input rule to the output rule. I tried adding a touch to the previous rule and adding the real input files to the params: like so:

    configfile: "config.yaml"

    fc = config["flowcell"]
    index = config["index"]
    samplename = config["sn"]
    lane = ["L01","L02","L03","L04"]
    con_ref = config["ref"]

    rule all:
        input:
            expand(["result/gatk4/{fc}_L01_{index}_piped.bam",
            "result/gatk4/{fc}_L02_{index}_piped.bam",
            "result/gatk4/{fc}_L03_{index}_piped.bam",
            "result/gatk4/{fc}_L04_{index}_piped.bam", "result/gatk4/SamToFastq.done",  "result/gatk4/samplename_index{index}_markedduplicates.bam"], fc = fc, lane = lane, index = index)

    input:
        "result/gatk4/{fc}_{lane}_{index}_markadapters.bam",
    output:
        unmapped="result/gatk4/{fc}_{lane}_{index}_fastqtosam.bam",
        out="result/gatk4/{fc}_{lane}_{index}_piped.bam",
        tmp="result/gatk4/SamToFastq.done"
    log:
        "result/logs/samtofastq/{fc}_{lane}_{index}.out"
    benchmark:
       "result/benchmarks/{fc}_{lane}_{index}.sam2fq.out"
    params:
        ref=con_ref
    container:
        config["containers"]["gatk4"]
    threads: 8
    shell:
        """
        gatk --java-options '-Xmx8G' SamToFastq
        I={input}
        FASTQ=/dev/stdout
        CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true
        bwa mem -M -t {threads} -p {params.ref} /dev/stdin |
        gatk --java-options '-Xmx20G' MergeBamAlignment
        ALIGNED_BAM=/dev/stdin
        UNMAPPED_BAM={output.unmapped}
        OUTPUT={output.out}
        REF={params.ref} CREATE_INDEX=true ADD_MATE_CIGAR=true
        CLIP_ADAPTERS=false CLIP_OVERLAPPING_READS=true
        INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1
        PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS
        TMP_DIR=`pwd`/tmp
        2>{log} &&
        touch {output.tmp}
        """

rule MarkDuplicates:
    input:
        "result/gatk4/SamToFastq.done"
    output:
        bam="result/gatk4/samplename_index{index}_markedduplicates.bam",
        txt="result/gatk4/samplename_index{index}_markedduplicates_metrics.txt"
    log:
        "result/logs/markduplicates/samplename_index{index}.out"
    benchmark:
       "result/benchmarks/samplename_index{index}.md.out"
    container:
        config["containers"]["gatk4"]
    params:
        L01="result/gatk4/{fc}_L01_{index}_piped.bam",
        L02="result/gatk4/{fc}_L02_{index}_piped.bam",
        L03="result/gatk4/{fc}_L03_{index}_piped.bam",
        L04="result/gatk4/{fc}_L04_{index}_piped.bam"
    threads: 8
    shell:
        """
        gatk --java-options '-Xmx30G' MarkDuplicates
        I={params.L01} I={params.L02} I={params.L03} I={params.L04}
        O={output.bam}
        M={output.txt}
        TMP_DIR=`pwd`/tmp
        2>{log}
        """

This generates an error:

SyntaxError:
Not all output, log and benchmark files of rule SamToFastq contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file.

I then tried creating the touch file another way:

    configfile: "config.yaml"

    fc = config["flowcell"]
    index = config["index"]
    samplename = config["sn"]
    lane = ["L01","L02","L03","L04"]
    con_ref = config["ref"]

    rule all:
        input:
            expand(["result/gatk4/{fc}_L01_{index}_piped.bam",
            "result/gatk4/{fc}_L02_{index}_piped.bam",
            "result/gatk4/{fc}_L03_{index}_piped.bam",
            "result/gatk4/{fc}_L04_{index}_piped.bam", "result/gatk4/SamToFastq.done",  "result/gatk4/samplename_index{index}_markedduplicates.bam"], fc = fc, lane = lane, index = index)

    input:
        "result/gatk4/{fc}_{lane}_{index}_markadapters.bam",
    output:
        unmapped="result/gatk4/{fc}_{lane}_{index}_fastqtosam.bam",
        out="result/gatk4/{fc}_{lane}_{index}_piped.bam",
        touch("result/gatk4/SamToFastq.done")
    log:
        "result/logs/samtofastq/{fc}_{lane}_{index}.out"
    benchmark:
       "result/benchmarks/{fc}_{lane}_{index}.sam2fq.out"
    params:
        ref=con_ref
    container:
        config["containers"]["gatk4"]
    threads: 8
    shell:
        """
        gatk --java-options '-Xmx8G' SamToFastq
        I={input}
        FASTQ=/dev/stdout
        CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true
        bwa mem -M -t {threads} -p {params.ref} /dev/stdin |
        gatk --java-options '-Xmx20G' MergeBamAlignment
        ALIGNED_BAM=/dev/stdin
        UNMAPPED_BAM={output.unmapped}
        OUTPUT={output.out}
        REF={params.ref} CREATE_INDEX=true ADD_MATE_CIGAR=true
        CLIP_ADAPTERS=false CLIP_OVERLAPPING_READS=true
        INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1
        PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS
        TMP_DIR=`pwd`/tmp
        2>{log}
        """

rule MarkDuplicates:
    input:
        "result/gatk4/SamToFastq.done"
    output:
        bam="result/gatk4/samplename_index{index}_markedduplicates.bam",
        txt="result/gatk4/samplename_index{index}_markedduplicates_metrics.txt"
    log:
        "result/logs/markduplicates/samplename_index{index}.out"
    benchmark:
       "result/benchmarks/samplename_index{index}.md.out"
    container:
        config["containers"]["gatk4"]
    params:
        L01="result/gatk4/{fc}_L01_{index}_piped.bam",
        L02="result/gatk4/{fc}_L02_{index}_piped.bam",
        L03="result/gatk4/{fc}_L03_{index}_piped.bam",
        L04="result/gatk4/{fc}_L04_{index}_piped.bam"
    threads: 8
    shell:
        """
        gatk --java-options '-Xmx30G' MarkDuplicates
        I={params.L01} I={params.L02} I={params.L03} I={params.L04}
        O={output.bam}
        M={output.txt}
        TMP_DIR=`pwd`/tmp
        2>{log}
        """

But I get the error:

positional argument follows keyword argument

Is there a way of getting around different wildcards for the input to output rules?? If so how would I achieve this??

snakemake • 3.1k views
ADD COMMENT
2
Entering edit mode
22 months ago
hugo.avila ▴ 530

In Snakemake, you can have different wildcards in the input and output of a rule. In your case, you have different wildcard names in your input and output files, which is allowed. However, the error you get is because the input to the rule MarkDuplicates is not defined correctly.

You have specified result/gatk4/SamToFastq.done as the input for the MarkDuplicates rule, but this file is not used as input for the shell command in the rule. Instead, you use the wildcard params.L01, params.L02, params.L03, and params.L04, which are not specified in the input of the rule.

To fix the error, you need to make sure that the input of the rule contains all files that are used as input in the shell command. In this case, you should change the input of the rule to include all input files, like so:

rule MarkDuplicates:
    input:
        L01="result/gatk4/{fc}_L01_{index}_piped.bam",
        L02="result/gatk4/{fc}_L02_{index}_piped.bam",
        L03="result/gatk4/{fc}_L03_{index}_piped.bam",
        L04="result/gatk4/{fc}_L04_{index}_piped.bam",
        samtofastq="result/gatk4/{fc}_{lane}_{index}_piped.bam"
    output:
        bam="result/gatk4/samplename_index{index}_markedduplicates.bam",
        txt="result/gatk4/samplename_index{index}_markedduplicates_metrics.txt"
    log:
        "result/logs/markduplicates/samplename_index{index}.out"
    benchmark:
       "result/benchmarks/samplename_index{index}.md.out"
    container:
        config["containers"]["gatk4"]
    threads: 8
    shell:
        """
        gatk --java-options '-Xmx30G' MarkDuplicates
        I={input.L01} I={input.L02} I={input.L03} I={input.L04} I={input.samtofastq}
        O={output.bam}
        M={output.txt}
        TMP_DIR=`pwd`/tmp
        2>{log}
        """

Here, samtofastq is added as an input file, and is used in the shell command. Note that the input dictionary should have the same keys as the parameter names used in the shell command.

bonus tip: you sometimes passed things that should have been inputs to a rule as parameters, avoid this. When your parameter is an actual a file, pass it as an input instead. This applies not only to input data, but also to index files and scripts. By doing so, when Snakemake builds the DAG, it will treat these files as dependencies and notify you with a clearer error message if they are missing.

ADD COMMENT
1
Entering edit mode
22 months ago
samuel ▴ 260

Hi hugo.avila,

Thanks for your reply.

I actually don't want to use the samtofastq="result/gatk4/{fc}_{lane}_{index}_piped.bam" file. The only reason why I thought I needed this was because when I tried to run the MarkDuplciates rule without it like below I get the error:

WildcardError in line 132 of snakefile: Wildcards in input files cannot be determined from output files: 'fc'
configfile: "config.yaml"

fc = config["flowcell"]
index = config["index"]
samplename = config["sn"]
lane = ["L01","L02","L03","L04"]
con_ref = config["ref"]

rule all:
    input:
        expand(["result/gatk4/{fc}_L01_{index}_piped.bam",
        "result/gatk4/{fc}_L02_{index}_piped.bam",
        "result/gatk4/{fc}_L03_{index}_piped.bam",
        "result/gatk4/{fc}_L04_{index}_piped.bam", "result/gatk4/samplename_index{index}_markedduplicates.bam"], fc = fc, lane = lane, index = index)
rule MarkDuplicates:
    input:
        L01="result/gatk4/{fc}_L01_{index}_piped.bam",
        L02="result/gatk4/{fc}_L02_{index}_piped.bam",
        L03="result/gatk4/{fc}_L03_{index}_piped.bam",
        L04="result/gatk4/{fc}_L04_{index}_piped.bam"
    output:
        bam="result/gatk4/samplename_index{index}_markedduplicates.bam",
        txt="result/gatk4/samplename_index{index}_markedduplicates_metrics.txt"
    log:
        "result/logs/markduplicates/samplename_index{index}.out"
    benchmark:
       "result/benchmarks/samplename_index{index}.md.out"
    container:
        config["containers"]["gatk4"]
    threads: 8
    shell:
        """
        gatk --java-options '-Xmx30G' MarkDuplicates
        I={input.L01} I={input.L02} I={input.L03} I= {input.L04}
        O={output.bam}
        M={output.txt}
        TMP_DIR=`pwd`/tmp
        2>{log}
        """
ADD COMMENT
1
Entering edit mode

Hello @sheryl, sorry my bad.

Humm You rule MarkDuplicates looks fine.

If the config file is not to big can you share it ? I think the problem is with the wildcard notation again.

Also can you share the name of the input files and desired output structure ? I think i know what is happening but i want to be sure.

My gess is that the problem seems to be that the MarkDuplicates rule can't figure it out the name of the outputs bc maybe you have two "{fc}" with the same "{index}".

What you can do is to add the {fc} to the name of the output (if possible):

#!/usr/bin/env python3

configfile: "config.yaml"

fc = config["flowcell"]
index = config["index"]
samplename = config["sn"]
lane = ["L01","L02","L03","L04"]
con_ref = config["ref"]

rule all:
    input:
        markedduplicates = expand(
            "result/gatk4/samplename_{fc}_index{index}_markedduplicates.bam",
            fc = fc,
            lane = lane,
            index = index
        )

rule MarkDuplicates:
    input:
        lanes = expand(
            "result/gatk4/{fc}_{lane}_{index}_piped.bam", ### <- HERE: added fc
            fc = fc,
            lane = lane,
            index = index
        )
    output:
        bam="result/gatk4/samplename_{fc}_index{index}_markedduplicates.bam",  ### <- HERE: added fc
        txt="result/gatk4/samplename_{fc}_index{index}_markedduplicates_metrics.txt"  ### <- HERE: added fc
    log:
        "result/logs/markduplicates/samplename_{fc}_index{index}.out"  ### <- HERE: added fc
    benchmark:
       "result/benchmarks/samplename_{fc}_index{index}.md.out"  ### <- HERE: added fc
    container:
        config["containers"]["gatk4"]
    threads:
        8
    shell:
        """
        gatk --java-options '-Xmx30G' MarkDuplicates
        -I {input.lanes}   ###   <- HERE: i think this sould work, if not, change bact to single calls. I just build the DAG i did not run anything
        -O {output.bam}
        -M {output.txt}
        TMP_DIR=`pwd`/tmp
        2>{log}
        """

I tried here with snakemake 7.8.5 and it worked. But maybe i got your file structure wrong, please do clarify if so.

The resulting DAG looks like this:

# i comment the config.yaml line
$ snakemake -pnc 1 --config flowcell=XXXX index=YYYY sn=test_01 ref=testHG
Building DAG of jobs...
Job stats:
job               count    min threads    max threads
# --------------  -------  -------------  -------------
MarkDuplicates        1              1              1
all                   1              1              1
total                 2              1              1


[Tue Feb 28 11:36:45 2023]
rule MarkDuplicates:
    input: result/gatk4/XXXX_L01_YYYY_piped.bam, result/gatk4/XXXX_L02_YYYY_piped.bam, result/gatk4/XXXX_L03_YYYY_piped.bam, result/gatk4/XXXX_L04_YYYY_piped.bam
    output: result/gatk4/samplename_XXXX_indexYYYY_markedduplicates.bam, result/gatk4/samplename_XXXX_indexYYYY_markedduplicates_metrics.txt
    log: result/logs/markduplicates/samplename_XXXX_indexYYYY.out
    jobid: 1
    benchmark: result/benchmarks/samplename_XXXX_indexYYYY.md.out
    reason: Missing output files: result/gatk4/samplename_XXXX_indexYYYY_markedduplicates.bam
    wildcards: fc=XXXX, index=YYYY
    resources: tmpdir=/tmp


        gatk --java-options '-Xmx30G' MarkDuplicates
        -I result/gatk4/XXXX_L01_YYYY_piped.bam result/gatk4/XXXX_L02_YYYY_piped.bam result/gatk4/XXXX_L03_YYYY_piped.bam result/gatk4/XXXX_L04_YYYY_piped.bam
        -O result/gatk4/samplename_XXXX_indexYYYY_markedduplicates.bam
        -M result/gatk4/samplename_XXXX_indexYYYY_markedduplicates_metrics.txt
        TMP_DIR=`pwd`/tmp
        2>result/logs/markduplicates/samplename_XXXX_indexYYYY.out


[Tue Feb 28 11:36:45 2023]
localrule all:
    input: result/gatk4/samplename_XXXX_indexYYYY_markedduplicates.bam
    jobid: 0
    reason: Input files updated by another job: result/gatk4/samplename_XXXX_indexYYYY_markedduplicates.bam
    resources: tmpdir=/tmp

Job stats:
job               count    min threads    max threads
# --------------  -------  -------------  -------------
MarkDuplicates        1              1              1
all                   1              1              1
total                 2              1              1


This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
ADD REPLY
1
Entering edit mode

Thanks for your answer hugo.avila

As the original question evolved slightly (as I didn't want to use the dummy file result/gatk4/SamToFastq.done in the input rule but thought I had to to get around the problematic {fc}), I posted an updated question here linked question In the end I gave up trying to not include {fc} into the output (re. the linked question as I couldn't quite follow what was going on and I kept getting errors) and this answer works well.

Thanks for you help - much appreciated and lots of learning had

ADD REPLY

Login before adding your answer.

Traffic: 1946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6