Question

Reading in input file into a Snakemake pipeline

1

Entering edit mode

3.0 years ago

Timotheus ▴ 40

Hello,

I'm trying to build my first Snakemake worklflow. I've got four sets of short reads for two samples (left/right reads for samples HB001_naive/coevolved) that I'd like to correct with a script bbduk.sh. I'm trying to write a rule for that:

SAMPLES=["HB001_naive", "HB001_coevolved"]
READ_SETS=["1", "2"]

rule all:
    input:
        expand("1_reads_qc/{sample}/{sample}_{read_set}.bbduk.fq.gz", sample=SAMPLES, read_set=READ_SETS)

rule bbduk:
    input: 
        expand("0_reads_raw/{sample}_{read_set}.fq.gz", sample=SAMPLES, read_set=READ_SETS),
    output:
        expand("1_reads_qc/{sample}/{sample}_{read_set}.bbduk.fq.gz", sample=SAMPLES, read_set=READ_SETS),
    conda:
        "envs/bbtools.yaml"
    threads: 12
    shell:
        "bbduk.sh -Xmx60g t={threads} \
            in1={input} in2={input} \
            out1={output} out2={output} \
            ref=/home/scro4331/.conda/envs/bbtools/opt/bbmap-38.18/resources/adapters.fa \
            ktrim=r k=23 mink=11 hdist=2 maq=10 minlen=100 tpe tbo \
            stats=bbduk.contaminants"

This lists all the input files after in1= and in2=, same for output. However, I'd obviously like Snakemake to consider first only the naive read sets (as in1=0_reads_raw/HB001_naive_1.fq.gz and in2=in1=0_reads_raw/HB001_naive_2.fq.gz), then the coevolved ones.

Any suggestions on how to make it work, allowing flexibility in adding more samples? I'd appreciate any help!

snakemake • 1.6k views

ADD COMMENT • link 3.0 years ago by Timotheus ▴ 40

score 3 · Accepted Answer · 2022-08-07

3

Entering edit mode

3.0 years ago

Shred ★ 1.6k

By doing

expand("0_reads_raw/{sample}_{read_set}.fq.gz", sample=SAMPLES, read_set=READ_SETS)

You're passing at the same time every permutation of $sample and $read_set , creating a list for the variable input. Doing that inside your rule you're passing this file list two times, for both in1 and in2.

Better splitting input/output into the two pair

rule bbduk:
    input: 
        i1 = "0_reads_raw/{sample}_1.fq.gz",
        i2 = "0_reads_raw/{sample}_2.fq.gz"
    output:
        i1 = "1_reads_qc/{sample}/{sample}_1.bbduk.fq.gz",
        i2 = "1_reads_qc/{sample}/{sample}_2.bbduk.fq.gz"
    shell:
       """
       bbduk.sh -Xmx60g t={threads} \
            in1={input.i1} in2={input.i2} \
            out1={output.i1} out2={output.i2} \
            ref=/home/scro4331/.conda/envs/bbtools/opt/bbmap-38.18/resources/adapters.fa \
            ktrim=r k=23 mink=11 hdist=2 maq=10 minlen=100 tpe tbo \
            stats=bbduk.contaminants
        """

ADD COMMENT • link 3.0 years ago by Shred ★ 1.6k

0

Entering edit mode

Thank you, this makes sense! I actually tried splitting the input, but clearly I did something wrong... There is one thing I still don't understand, however: when I run this without rule all I get Target rules may not contain wildcards. On the other hand, the first example in the Snakemake tutorial (https://snakemake.readthedocs.io/en/stable/tutorial/basics.html#step-2-generalizing-the-read-mapping-rule) uses a wildcard before rule all is introduced. Why does my code not work while the example does??

ADD REPLY • link 3.0 years ago by Timotheus ▴ 40

1

Entering edit mode

Snakemake starts from the final output file of your (desired) workflow (which matches the declared target file input), then goes backwards to check which rule it has to launch. Without passing an expansion into a target rule, Snakemake doesn't know how to expand wildcards. In the example posted, they launched snakemake passing the explicit output file requested

snakemake -np mapped_reads/A.bam mapped_reads/B.bam

So, by writing

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

Snakemake expands text between the slash and the comma into a list, sample, following the output pattern; then it tries to embed, for each item into the sample list, the same value inside the input expression. That's why you need to always have the same matching pattern in output/input, which could be easly propagated into many rules' input/output using always same wildcards.