Question

Snakemake error

1

Entering edit mode

2.7 years ago

kamanovae ▴ 100

Hi! I have a snakemake pipeline with an error that I can't find. I successfully run all the rules in turn, but if I run the pipeline in one click, then the sort rules for some reason are launched repeatedly by strange output folders

Job stats:
job                       count    min threads    max threads
----------------------  -------  -------------  -------------
all                           1              1              1
bedtools_genomecov            1              1              1
bwamem                        2              1              1
gatk_ApplyBQSR               64              1              1
gatk_BaseRecalibrator        64              1              1
gatk_GatherBQSRReports        1              1              1
repair                        2              1              1
samtools_index                1              1              1
sort                          1              1              1
total                       137              1              1

Select jobs to execute...


rule sort:
    output: out///SRR2962669/SRR2962669//SRR2962669/SRR2962669_R1.fastq, out///SRR2962669/SRR2962669//SRR2962669/SRR2962669_R2.fastq
    jobid: 8
    wildcards: sample=/SRR2962669/SRR2962669
    resources: tmpdir=/tmp

This is error:

Error in rule sort:
    jobid: 8
    output: out///SRR2962669/SRR2962669//SRR2962669/SRR2962669_R1.fastq, out///SRR2962669/SRR2962669//SRR2962669/SRR2962669_R2.fastq
    shell:
         cat /storage1/GatkBwaTest/exom//SRR2962669/SRR2962669*1*.fastq | paste - - - - |  sort -k1,1 -S 30G | tr ' ' '
' > out///SRR2962669/SRR2962669//SRR2962669/SRR2962669_R1.fastq ; cat /storage1/GatkBwaTest/exom//SRR2962669/SRR2962669*2*.fastq | paste - - - - |  sort -k1,1 -S 30G | tr '    ' '
' > out///SRR2962669/SRR2962669//SRR2962669/SRR2962669_R2.fastq
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

But if you run one rule sort, then there are no problems

Pipeline looks like this:

import os

(SAMPLES,) = glob_wildcards("/storage1/GatkBwaTest/exom/SRR2962669/{sample}_1.fastq")
INTERVALS = glob_wildcards("/storage1/GatkBwaTest/SnakemakeInput/intervals/{interval}.bed")

SAMPLE_PATH=os.path.dirname(config["sample"])
print(SAMPLE_PATH)
gatk="/storage1/GatkBwaTest/bin/gatk-4.1.3.0/gatk"
bwa_mem="/storage1/GatkBwaTest/bin/bwa-mem2-2.2.1_x64-linux/bwa-mem2"
repair="/storage1/GatkBwaTest/bin/repair.sh"
bcftools="/storage1/GatkBwaTest/bin/bcftools-1.14/bcftools"
filterbyname= "/storage1/GatkBwaTest/bin/bbmap/filterbyname.sh"
samtools="/storage1/GatkBwaTest/bin/samtools-1.11/samtools"



rule all:
        input:
                expand(config["out"] + "/{sample}/{sample}_R1.fastq", sample=SAMPLES),
                expand(config["out"] + "/{sample}/{sample}_R2.fastq", sample=SAMPLES),
        expand(config["out"] + "/{sample}/{sample}_fastrepair_R1.fastq", sample=SAMPLES),
        expand(config["out"] + "/{sample}/{sample}_fastrepair_R2.fastq",  sample=SAMPLES),
        expand(config["out"] + "/{sample}/{sample}_fastrepair_outs.fastq", sample=SAMPLES),
        expand(config["out"] + "/{sample}/{sample}.sorted.mkdup.bam", sample=SAMPLES),
        expand(config["out"] + "/{sample}/{sample}.sorted.mkdup.bam.bai", sample=SAMPLES),
        expand(config["out"] + "/{sample}/{sample}.bedtools_genomecov.genome.covered.bed", sample=SAMPLES),
        expand(config["out_BaseRecalibrator"] + "/{sample}/{sample}.BaseRecalibrator.{interval}.txt", sample=SAMPLES, interval=INTERVALS.interval),
        expand(config["out_BaseRecalibrator"] + "/{sample}/{sample}.GatherBQSRReports.txt",  sample=SAMPLES),
        expand(config["out_ApplyBQSR"] + "/{sample}/{sample}.out_ApplyBQSR.{interval}.bam", sample=SAMPLES, interval=INTERVALS.interval),

rule sort:
        output:
                forward_reads=config["out"] +"/{sample}/{sample}_R1.fastq", reverse_reads=config["out"] +"/{sample}/{sample}_R2.fastq"
        shell:
                " cat {SAMPLE_PATH}/{wildcards.sample}*1*.fastq | paste - - - - |  sort -k1,1 -S 30G | tr '\t' '\n' > {output.forward_reads} ;"
                " cat {SAMPLE_PATH}/{wildcards.sample}*2*.fastq | paste - - - - |  sort -k1,1 -S 30G | tr '\t' '\n' > {output.reverse_reads}"
rule repair:
    input:
        forward_reads=rules.sort.output.forward_reads, reverse_reads=rules.sort.output.reverse_reads
    output:
        forward_fastrepair=config["out"] + "/{sample}/{sample}_fastrepair_R1.fastq",
        reverse_fastrepair=config["out"] + "/{sample}/{sample}_fastrepair_R2.fastq", 
        outs=config["out"] + "/{sample}/{sample}_fastrepair_outs.fastq"
    log:
        fastrepair=config["out_logs"]+"/{sample}/{sample}.fastrepair.log"
    shell:
        " {repair} -Xmx30g in={input.forward_reads} in2={input.reverse_reads} out={output.forward_fastrepair} out2={output.reverse_fastrepair} "
        " outs={output.outs} sorted=true "      
rule bwamem:
    input:
        ref=config["ref"], forward_reads=rules.repair.output.forward_fastrepair,
        reverse_reads=rules.repair.output.reverse_fastrepair
    output:
        bam=config["out"] + "/{sample}/{sample}.sorted.mkdup.bam"
    params:
        bwa_threads=config["bwamem_threads"], fixmate_threads=config["fixmate_threads"], sort_threads=["sort_threads"],
        per_thread_sort_mem=config["per_thread_sort_mem"], markdup_threads=config["markdup_threads"]
    log:
        bwa=config["out_logs"]+"/{sample}/{sample}.bwa.log", fixmate=config["out_logs"]+"/{sample}/{sample}.fixmate.log",
        sort=config["out_logs"]+"/{sample}/{sample}.sort.log", markdup=config["out_logs"]+"/{sample}/{sample}.markdup.log"
    shell:
        " {bwa_mem} mem -t {params.bwa_threads} {input.ref} {input.forward_reads} {input.reverse_reads} "
        " -R \'@RG\\tID:{wildcards.sample}\\tPU:unit1\\tSM:{wildcards.sample}\\tPL:Illumina\\tLB:lib1\' 2>{log.bwa} | "
        " {samtools} fixmate  -m - -  2>{log.fixmate} | "
        " {samtools} sort -T 'sort_temorary' -@ {params.sort_threads}  2>{log.sort} | "
        " {samtools} markdup -@ {params.markdup_threads} - {output.bam} 2>{log.markdup} "

rule samtools_index:
    input:
        bam=config["out"] + "/{sample}/{sample}.sorted.mkdup.bam"
    output:
        index=config["out"] + "/{sample}/{sample}.sorted.mkdup.bam.bai"         
    log:
        index=config["out_logs"]+"/{sample}/{sample}.samtools_index.log"
    shell:
        " {samtools} index {input.bam} {output.index}"

rule bedtools_genomecov:
    input:
        bam=rules.bwamem.output.bam
    output:
        bed=config["out"]+"{sample}.bedtools_genomecov.genome.covered.bed"
    shell:
        "bedtools genomecov -ibam  {input.bam} -bg > {output.bed}"

rule gatk_BaseRecalibrator:
    input:
        interval=config["intervals"]+"{interval}.bed", bam=rules.bwamem.output.bam,
        ref=config["ref"], dbsnp=config["dbsnp"]
    output:
        txt=config["out_BaseRecalibrator"]+"/{sample}/{sample}.BaseRecalibrator.{interval}.txt"
    params:
        gatk_memory=config["gatk_memory"]
    shell:
        "{gatk} --java-options {params.gatk_memory}  BaseRecalibrator -L {input.interval} -I {input.bam} -O {output.txt} -R {input.ref} "
        " --known-sites {input.dbsnp} "

rule gatk_GatherBQSRReports:
        input:
                txt=expand(config["out_BaseRecalibrator"]+"/{sample}/{sample}.BaseRecalibrator.{interval}.txt", interval=INTERVALS.interval, allow_missing=True)
        params:
                gatk_memory=config["gatk_memory"]
        log:
                gatherBQSRReports=config["out_logs"]+"/{sample}/{sample}.gatherBQSRReports.log"
        output:
                all_txt=config["out_BaseRecalibrator"]+"/{sample}/{sample}.GatherBQSRReports.txt"
        run:
                params_data = '-I ' + ' -I '.join(input.txt)
                cmd = "{gatk} --java-options {params.gatk_memory} GatherBQSRReports %s -O {output.all_txt} 2> {log.gatherBQSRReports}" % params_data
                shell(cmd)

rule gatk_ApplyBQSR:
    input:
        intervals=config["intervals"]+"{interval}.bed", ref=config["ref"],
        txt=rules.gatk_GatherBQSRReports.output.all_txt, bam=rules.bwamem.output.bam
    output:
        ApplyBQSR_bam=config["out_ApplyBQSR"]+"/{sample}/{sample}.out_ApplyBQSR.{interval}.bam"
    params:
        gatk_memory=config["gatk_memory"]
    shell:
        " {gatk} --java-options {params.gatk_memory}  ApplyBQSR -L {input.intervals} -I {input.bam} -O {output.ApplyBQSR_bam} "
        " -R {input.ref} -bqsr {input.txt} --create-output-bam-index false "

snakemake bwa GATK • 1.1k views

ADD COMMENT • link updated 2.7 years ago by JamesBioInf ▴ 10 • written 2.7 years ago by kamanovae ▴ 100

1

Entering edit mode

My guess is that this line might the problem: $ cat /storage1/GatkBwaTest/exom//SRR2962669/SRR2962669*1*.fastq esp this: SRR2962669/SRR2962669*1*.fastq. Create a rule to run only this line/ function and check if this is the offending line/function. There is also a way to tell snakemake not to run bash in strict mode.

What is the purpose of this line:

 cat {SAMPLE_PATH}/{wildcards.sample}*1*.fastq | paste - - - - |  sort -k1,1 -S 30G | tr '\t' '\n' > {output.forward_reads} ;

ADD REPLY • link 2.7 years ago by cpad0112 21k

0

Entering edit mode

what is the value of {SAMPLE_PATH}?

ADD REPLY • link 2.7 years ago by raphael.B ▴ 520

score 0 · Answer 1 · 2022-03-01

Hi :)

Perhaps you could try adding an input to the sort rule to have more control over what is going in (and therefore coming out)?

I'm not totally sure from the example but I feel it may be due to some unintended wildcard expansion or glob_wildcards picking up things you did not mean to. If you try printing SAMPLES it may give some clues. I hope this helps!

Something like:

rule sort:
    input:
            fwd_reads =  os.path.join(SAMPLE_PATH, "{sample}*1*.fastq"),
            rev_reads = os.path.join(SAMPLE_PATH, "{sample}*2*.fastq")
    output:
            forward_reads=config["out"] +"/{sample}/{sample}_R1.fastq", reverse_reads=config["out"] +"/{sample}/{sample}_R2.fastq"
    shell:
            " cat {input.fwd_reads} | paste - - - - |  sort -k1,1 -S 30G | tr '\t' '\n' > {output.forward_reads} ;"
            " cat {input.rev_reads} | paste - - - - |  sort -k1,1 -S 30G | tr '\t' '\n' > {output.reverse_reads}"