Merge several input files to single output file snakemake
3
0
Entering edit mode
22 months ago
samuel ▴ 260

Hi, I am a complete newbie at snakemake. I had a similar post here how to get around snakemake with different wildcards input to output rule but I thought I would ask a simplified question.

I have previous rule ( not shown but works well) that generates x4 (bam) files. The next rule takes these four files as inputs and needs to produce one file as output.

I have tried the following:

configfile: "config.yaml"

fc = config["flowcell"]
index = config["index"]
samplename = config["sn"]



rule all:
    input:
        expand(["result/gatk4/{fc}_L01_{index}_piped.bam",
        "result/gatk4/{fc}_L02_{index}_piped.bam",
        "result/gatk4/{fc}_L03_{index}_piped.bam",
        "result/gatk4/{fc}_L04_{index}_piped.bam", "result/gatk4/samplename_index{index}_markedduplicates.bam"], fc = fc,  index = index)
rule MarkDuplicates:
    input:
        L01="result/gatk4/{fc}_L01_{index}_piped.bam",
        L02="result/gatk4/{fc}_L02_{index}_piped.bam",
        L03="result/gatk4/{fc}_L03_{index}_piped.bam",
        L04="result/gatk4/{fc}_L04_{index}_piped.bam"
    output:
        bam="result/gatk4/samplename_index{index}_markedduplicates.bam",
        txt="result/gatk4/samplename_index{index}_markedduplicates_metrics.txt"
    shell:
        """
        gatk --java-options '-Xmx30G' MarkDuplicates
        I={input.L01} I={input.L02} I={input.L03} I= {input.L04}
        O={output.bam}
        M={output.txt}
        TMP_DIR=`pwd`/tmp
        2>{log}
        """

But I get the error:

Wildcards in input files cannot be determined from output files: 'fc'

I have tried looking at similar questions like here enter link description here But as a newbie I can't quite follow the code.

I was wondering if I could make a list and use this as the input to the rule like:

test = ["result/gatk4/V350134767_L01_3_piped.bam",  "result/gatk4/V350134767_L02_3_piped.bam", "result/gatk4/V350134767_L03_3_piped.bam", "result/gatk4/V350134767_L04_3_piped.bam"]

rule MarkDuplicates:
    input:
        test
    output:
        bam="result/gatk4/samplename_index{index}_markedduplicates.bam",
        txt="result/gatk4/samplename_index{index}_markedduplicates_metrics.txt"
    shell:
        """
        gatk --java-options '-Xmx30G' MarkDuplicates
        I={test[0]} I={test[1]} I={test[2]} I={test[3]}
        O={output.bam}
        M={output.txt}
        TMP_DIR=`pwd`/tmp
        2>{log}
        """

Then I could work out how to generate this list without having to hard code the sample names but I don't know if it's possible (and don't know the correct syntax)

Please help.

snakemake • 2.1k views
ADD COMMENT
1
Entering edit mode
22 months ago
raphael.B ▴ 520

Something like this should work. In general it is quite weird to have a wildcard present in the input but not in the output. This occurs only if you are merging on the different values taken by this wildcard through an expand. I think your problem comes from there

rule all:
    input:
        expand( "result/gatk4/samplename_{fc}_index{index}_markedduplicates.bam", fc = fc,  index = index)


rule MarkDuplicates:
    input:
        L01="result/gatk4/{fc}_L01_{index}_piped.bam",
        L02="result/gatk4/{fc}_L02_{index}_piped.bam",
        L03="result/gatk4/{fc}_L03_{index}_piped.bam",
        L04="result/gatk4/{fc}_L04_{index}_piped.bam"
    output:
        bam="result/gatk4/samplename_{fc}_index{index}_markedduplicates.bam",
        txt="result/gatk4/samplename_{fc}_index{index}_markedduplicates_metrics.txt"
    shell:
        """
        gatk --java-options '-Xmx30G' MarkDuplicates
        I={input.L01} I={input.L02} I={input.L03} I= {input.L04}
        O={output.bam}
        M={output.txt}
        TMP_DIR=`pwd`/tmp
        2>{log}
        """

To answer your second question (if I understood it well), a nicer way to do this would be something like this (not tested)

rule MarkDuplicates:
    input:
        expand("result/gatk4/{{fc}}_L{id}_{{index}}_piped.bam", id=['01','02',...])
    output:
        bam="result/gatk4/samplename_{fc}_index{index}_markedduplicates.bam",
        txt="result/gatk4/samplename_{fc}_index{index}_markedduplicates_metrics.txt"
    params:
        fmt=lambda wildcards,input: ' -I='.join(input)
    shell:
        """
        gatk --java-options '-Xmx30G' MarkDuplicates
        I={params.fmt}
        O={output.bam}
        M={output.txt}
        TMP_DIR=`pwd`/tmp
        2>{log}
        """
ADD COMMENT
0
Entering edit mode
22 months ago
sschmeier ▴ 120

Does your outputs not just simply need to include the {fc} variable in the name?

ADD COMMENT
0
Entering edit mode

I didn't want it in the output name.....is there a way to succeed without this?

ADD REPLY
0
Entering edit mode
22 months ago
sschmeier ▴ 120

Input and output must have the same wildcards generally.

In your case you can circumvent using an input function that maps the {index} to the correct input filenames:

def get_inputs(wildcards):
  ind = wildcards.index
  # now you would need to get the correct fc based on ind 
  # e.g. if you have it stored in a dict you could do 
  # could be set after reading the sampletable etc...
  fc = index2fc[ind]
  results= [f"result/gatk4/{fc}_L01_{ind}_piped.bam",
            f"result/gatk4/{fc}_L02_{ind}_piped.bam",
            f"result/gatk4/{fc}_L03_{ind}_piped.bam",
            f"result/gatk4/{fc}_L04_{ind}_piped.bam"]
  return results

Then you could use it as input to MarkDuplicates.

rule MarkDuplicates:
     input:
         get_inputs
    ...
ADD COMMENT

Login before adding your answer.

Traffic: 3160 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6