Question

Snakemake how to use glob_wilcards properly

1

Entering edit mode

3.7 years ago

Peter Chung ▴ 210

I have many paired fastq files and I have a problem on after running trim_galore package, as it named the fastq files with _1_val_1 and _2_val_2, for example: AD50_CTGATCGTA_1_val_1.fq.gz and AD50_CTGATCGTA_2_val_2.fq.gz.

I would like continue snakemake and use

import os
import snakemake.io
import glob

DIR="AD50"
(SAMPLES,READS,) = glob_wildcards(DIR+"{sample}_{read}.fq.gz")
READS=["1","2"]

rule all:
    input:
        expand(DIR+"{sample}_dedup_{read}.fq.gz",sample=SAMPLES,read=READS)

rule clumpify:
    input:
        r1=DIR+"{sample}_1_val_1.fq.gz",
        r2=DIR+"{sample}_2_val_2.fq.gz"
    output:
        r1out=DIR+"{sample}_dedup_1.fq.gz",
        r2out=DIR+"{sample}_dedup_2.fq.gz"
    shell:
        "clumpify.sh in={input.r1} in2={input.r2} out={output.r1out} out2={output.r2out} dedupe subs=0"

and the error is:

Building DAG of jobs...
MissingInputException in line 13 of /home/peterchung/Desktop/Rerun-Result/clumpify.smk:
Missing input files for rule clumpify:
AD50/AD50_CTGATCGTA_2_val_2_val_2.fq.gz
AD50/AD50_CTGATCGTA_2_val_1_val_1.fq.gz

I tired another way, somehow the closest is that it detected the missing input like AD50_CTGATCGTA_1_val_2.fq.gz and AD50_CTGATCGTA_2_val_1.fq.gz which is not exist.

I am not sure the glob_wildcards function I used properly since there are many underscore in it. I tired:

 glob_wildcards(DIR+"{sample}_{read}_val_{read}.fq.gz")

but it did not work as well.

glob wildcards snakemake python • 3.3k views

ADD COMMENT • link 3.7 years ago by Peter Chung ▴ 210

score 0 · Answer 1 · 2021-03-31

If it's not too late and if feasible for you, I would run trim_galore within your snakefile. That is, add a rule that takes in input the raw fastq files and produces the trimmed *_val_* files. Maybe something like:

rule trim_galore:
    input:
        r1= DIR+"{sample}_1.fq.gz",
        r2= DIR+"{sample}_2.fq.gz",
    output:
        r1=DIR+"{sample}_1_val_1.fq.gz",
        r2=DIR+"{sample}_2_val_2.fq.gz",
    shell:
        r"""
        trim_galore ... {input} {output} 
        """

If you have trimmed files and you don't want to do the trimming again, make sure they are in a directory consistent with the code in the snakefile. Snakemake complains that it cannot find AD50/AD50_CTGATCGTA_2_val_2_val_2.fq.gz. So, where do you have the trimmed files and how are they named? Showing the command line you use to run snakemake would also help.