Hi! I have a snakemake pipeline with an error that I can't find. I successfully run all the rules in turn, but if I run the pipeline in one click, then the sort rules for some reason are launched repeatedly by strange output folders
Job stats:
job count min threads max threads
---------------------- ------- ------------- -------------
all 1 1 1
bedtools_genomecov 1 1 1
bwamem 2 1 1
gatk_ApplyBQSR 64 1 1
gatk_BaseRecalibrator 64 1 1
gatk_GatherBQSRReports 1 1 1
repair 2 1 1
samtools_index 1 1 1
sort 1 1 1
total 137 1 1
Select jobs to execute...
rule sort:
output: out///SRR2962669/SRR2962669//SRR2962669/SRR2962669_R1.fastq, out///SRR2962669/SRR2962669//SRR2962669/SRR2962669_R2.fastq
jobid: 8
wildcards: sample=/SRR2962669/SRR2962669
resources: tmpdir=/tmp
This is error:
Error in rule sort:
jobid: 8
output: out///SRR2962669/SRR2962669//SRR2962669/SRR2962669_R1.fastq, out///SRR2962669/SRR2962669//SRR2962669/SRR2962669_R2.fastq
shell:
cat /storage1/GatkBwaTest/exom//SRR2962669/SRR2962669*1*.fastq | paste - - - - | sort -k1,1 -S 30G | tr ' ' '
' > out///SRR2962669/SRR2962669//SRR2962669/SRR2962669_R1.fastq ; cat /storage1/GatkBwaTest/exom//SRR2962669/SRR2962669*2*.fastq | paste - - - - | sort -k1,1 -S 30G | tr ' ' '
' > out///SRR2962669/SRR2962669//SRR2962669/SRR2962669_R2.fastq
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
But if you run one rule sort, then there are no problems
Pipeline looks like this:
import os
(SAMPLES,) = glob_wildcards("/storage1/GatkBwaTest/exom/SRR2962669/{sample}_1.fastq")
INTERVALS = glob_wildcards("/storage1/GatkBwaTest/SnakemakeInput/intervals/{interval}.bed")
SAMPLE_PATH=os.path.dirname(config["sample"])
print(SAMPLE_PATH)
gatk="/storage1/GatkBwaTest/bin/gatk-4.1.3.0/gatk"
bwa_mem="/storage1/GatkBwaTest/bin/bwa-mem2-2.2.1_x64-linux/bwa-mem2"
repair="/storage1/GatkBwaTest/bin/repair.sh"
bcftools="/storage1/GatkBwaTest/bin/bcftools-1.14/bcftools"
filterbyname= "/storage1/GatkBwaTest/bin/bbmap/filterbyname.sh"
samtools="/storage1/GatkBwaTest/bin/samtools-1.11/samtools"
rule all:
input:
expand(config["out"] + "/{sample}/{sample}_R1.fastq", sample=SAMPLES),
expand(config["out"] + "/{sample}/{sample}_R2.fastq", sample=SAMPLES),
expand(config["out"] + "/{sample}/{sample}_fastrepair_R1.fastq", sample=SAMPLES),
expand(config["out"] + "/{sample}/{sample}_fastrepair_R2.fastq", sample=SAMPLES),
expand(config["out"] + "/{sample}/{sample}_fastrepair_outs.fastq", sample=SAMPLES),
expand(config["out"] + "/{sample}/{sample}.sorted.mkdup.bam", sample=SAMPLES),
expand(config["out"] + "/{sample}/{sample}.sorted.mkdup.bam.bai", sample=SAMPLES),
expand(config["out"] + "/{sample}/{sample}.bedtools_genomecov.genome.covered.bed", sample=SAMPLES),
expand(config["out_BaseRecalibrator"] + "/{sample}/{sample}.BaseRecalibrator.{interval}.txt", sample=SAMPLES, interval=INTERVALS.interval),
expand(config["out_BaseRecalibrator"] + "/{sample}/{sample}.GatherBQSRReports.txt", sample=SAMPLES),
expand(config["out_ApplyBQSR"] + "/{sample}/{sample}.out_ApplyBQSR.{interval}.bam", sample=SAMPLES, interval=INTERVALS.interval),
rule sort:
output:
forward_reads=config["out"] +"/{sample}/{sample}_R1.fastq", reverse_reads=config["out"] +"/{sample}/{sample}_R2.fastq"
shell:
" cat {SAMPLE_PATH}/{wildcards.sample}*1*.fastq | paste - - - - | sort -k1,1 -S 30G | tr '\t' '\n' > {output.forward_reads} ;"
" cat {SAMPLE_PATH}/{wildcards.sample}*2*.fastq | paste - - - - | sort -k1,1 -S 30G | tr '\t' '\n' > {output.reverse_reads}"
rule repair:
input:
forward_reads=rules.sort.output.forward_reads, reverse_reads=rules.sort.output.reverse_reads
output:
forward_fastrepair=config["out"] + "/{sample}/{sample}_fastrepair_R1.fastq",
reverse_fastrepair=config["out"] + "/{sample}/{sample}_fastrepair_R2.fastq",
outs=config["out"] + "/{sample}/{sample}_fastrepair_outs.fastq"
log:
fastrepair=config["out_logs"]+"/{sample}/{sample}.fastrepair.log"
shell:
" {repair} -Xmx30g in={input.forward_reads} in2={input.reverse_reads} out={output.forward_fastrepair} out2={output.reverse_fastrepair} "
" outs={output.outs} sorted=true "
rule bwamem:
input:
ref=config["ref"], forward_reads=rules.repair.output.forward_fastrepair,
reverse_reads=rules.repair.output.reverse_fastrepair
output:
bam=config["out"] + "/{sample}/{sample}.sorted.mkdup.bam"
params:
bwa_threads=config["bwamem_threads"], fixmate_threads=config["fixmate_threads"], sort_threads=["sort_threads"],
per_thread_sort_mem=config["per_thread_sort_mem"], markdup_threads=config["markdup_threads"]
log:
bwa=config["out_logs"]+"/{sample}/{sample}.bwa.log", fixmate=config["out_logs"]+"/{sample}/{sample}.fixmate.log",
sort=config["out_logs"]+"/{sample}/{sample}.sort.log", markdup=config["out_logs"]+"/{sample}/{sample}.markdup.log"
shell:
" {bwa_mem} mem -t {params.bwa_threads} {input.ref} {input.forward_reads} {input.reverse_reads} "
" -R \'@RG\\tID:{wildcards.sample}\\tPU:unit1\\tSM:{wildcards.sample}\\tPL:Illumina\\tLB:lib1\' 2>{log.bwa} | "
" {samtools} fixmate -m - - 2>{log.fixmate} | "
" {samtools} sort -T 'sort_temorary' -@ {params.sort_threads} 2>{log.sort} | "
" {samtools} markdup -@ {params.markdup_threads} - {output.bam} 2>{log.markdup} "
rule samtools_index:
input:
bam=config["out"] + "/{sample}/{sample}.sorted.mkdup.bam"
output:
index=config["out"] + "/{sample}/{sample}.sorted.mkdup.bam.bai"
log:
index=config["out_logs"]+"/{sample}/{sample}.samtools_index.log"
shell:
" {samtools} index {input.bam} {output.index}"
rule bedtools_genomecov:
input:
bam=rules.bwamem.output.bam
output:
bed=config["out"]+"{sample}.bedtools_genomecov.genome.covered.bed"
shell:
"bedtools genomecov -ibam {input.bam} -bg > {output.bed}"
rule gatk_BaseRecalibrator:
input:
interval=config["intervals"]+"{interval}.bed", bam=rules.bwamem.output.bam,
ref=config["ref"], dbsnp=config["dbsnp"]
output:
txt=config["out_BaseRecalibrator"]+"/{sample}/{sample}.BaseRecalibrator.{interval}.txt"
params:
gatk_memory=config["gatk_memory"]
shell:
"{gatk} --java-options {params.gatk_memory} BaseRecalibrator -L {input.interval} -I {input.bam} -O {output.txt} -R {input.ref} "
" --known-sites {input.dbsnp} "
rule gatk_GatherBQSRReports:
input:
txt=expand(config["out_BaseRecalibrator"]+"/{sample}/{sample}.BaseRecalibrator.{interval}.txt", interval=INTERVALS.interval, allow_missing=True)
params:
gatk_memory=config["gatk_memory"]
log:
gatherBQSRReports=config["out_logs"]+"/{sample}/{sample}.gatherBQSRReports.log"
output:
all_txt=config["out_BaseRecalibrator"]+"/{sample}/{sample}.GatherBQSRReports.txt"
run:
params_data = '-I ' + ' -I '.join(input.txt)
cmd = "{gatk} --java-options {params.gatk_memory} GatherBQSRReports %s -O {output.all_txt} 2> {log.gatherBQSRReports}" % params_data
shell(cmd)
rule gatk_ApplyBQSR:
input:
intervals=config["intervals"]+"{interval}.bed", ref=config["ref"],
txt=rules.gatk_GatherBQSRReports.output.all_txt, bam=rules.bwamem.output.bam
output:
ApplyBQSR_bam=config["out_ApplyBQSR"]+"/{sample}/{sample}.out_ApplyBQSR.{interval}.bam"
params:
gatk_memory=config["gatk_memory"]
shell:
" {gatk} --java-options {params.gatk_memory} ApplyBQSR -L {input.intervals} -I {input.bam} -O {output.ApplyBQSR_bam} "
" -R {input.ref} -bqsr {input.txt} --create-output-bam-index false "
My guess is that this line might the problem:
$ cat /storage1/GatkBwaTest/exom//SRR2962669/SRR2962669*1*.fastq
esp this:SRR2962669/SRR2962669*1*.fastq
. Create a rule to run only this line/ function and check if this is the offending line/function. There is also a way to tell snakemake not to run bash in strict mode.What is the purpose of this line:
what is the value of
{SAMPLE_PATH}
?