Hello, I was struggling for a while about the execution of Snakefile rules. I have a scenario that some rules must be completed before the next one starts because of the dependency, especially when parallel processes are used with the -j option. I met this situation when the sample sizes are very different causing the completion of each step varied too much, some may be 1 hour difference. And this caused the execution of the rules snaked badly and caused error because the execution time cost are so different!
My question is: How to ensure Snakefile rules execution to complete one after another?
Following is a fragment of my Snakefile, and some details may be missing when I copy-n-paste.
SAMPLES, = glob_wildcards("raw/{sample}_R1.fq.gz")
READS=["R1", "R2"]
rule all:
input: expand("qc_raw/{sample}_{read}_fastqc.html", sample=SAMPLES, read=READS),
expand("seqs_trimmed/{sample}_trimmed_PE_{read}.fq.gz", sample=SAMPLES, read=READS),
expand("qc_trimmed/{sample}_trimmed_PE_{read}_fastqc.html", sample=SAMPLES, read=READS),
expand("alignment/{sample}.bam", sample=SAMPLES),
expand("snps_out/SNPs_candidates.g.vcf"),
expand("snps_out/combined_gvcf")
rule fastqc_raw:
input: expand("seqs_raw/{sample}_{read}.fq.gz", sample=SAMPLES, read=READS)
output: expand("qc_raw/{sample}_{read}_fastqc.html", sample=SAMPLES, read=READS)
log: expand("logs/fastqc/{sample}_{read}_raw.fastqc.err", sample=SAMPLES, read=READS)
shell: """
mkdir -p logs/fastqc
mkdir -p qc_raw
fastqc --outdir qc_raw --thread 8 --nogroup {input} \
>> data3/fastqc/raw/fastqc.log 2>>{log}
"""
rule trimming:
input:
fwd = "seqs_raw/{sample}_R1.fq.gz",
rev = "seqs_raw/{sample}_R2.fq.gz"
output:
fwd_paired = "seqs_trimmed/{sample}_trimmed_PE_R1.fq.gz",
fwd_unpaired = "seqs_trimmed/{sample}_trimmed_SE_R1.fq.gz",
rev_paired = "seqs_trimmed/{sample}_trimmed_PE_R2.fq.gz",
rev_unpaired = "seqs_trimmed/{sample}_trimmed_SE_R2.fq.gz"
params:
TRIMMOMATIC_JAR = "/path/to/download-software/trimmomatic-0.38-1/trimmomatic.jar",
PAIR = "PE SE".split(),
TRIM_OPTS = "-phred33 ILLUMINACLIP:" + ADAPTORS + ":2:30:10 LEADING:20 TRAILING:20 HEADCROP:18 SLIDINGWINDOW:4:20 MINLEN:36 "
log: "logs/trimmomatic/{sample}.PE.trimomatic.log"
shell: """
mkdir -p logs/trimmomatic
java -jar {params.TRIMMOMATIC_JAR} \
{params.PAIR[0]} {input.fwd} {input.rev} \
{output.fwd_paired} {output.fwd_unpaired} \
{output.rev_paired} {output.rev_unpaired} \
{params.TRIM_OPTS} 2>{log}
"""
rule fastqc_trimmed:
input: expand("seqs_trimmed/{sample}_trimmed_PE_{read}.fq.gz", sample=SAMPLES, read=READS)
output: expand("qc_trimmed/{sample}_trimmed_PE_{read}_fastqc.html", sample=SAMPLES, read=READS)
log: expand("logs/fastqc/{sample}_{read}_trimmed.fastqc.err", sample=SAMPLES, read=READS)
shell: """
mkdir -p qc_trimmed
fastqc --outdir qc_trimmed --thread 8 --nogroup {input} \
>> qc_trimmed/fastqc.log 2>>{log}
"""
rule alignment:
input: fwd="seqs_trimmed/{sample}_trimmed_PE_R1.fq.gz",
rev="seqs_trimmed/{sample}_trimmed_PE_R2.fq.gz"
output: "alignment/{sample}.bam"
params: threads=8,
indexbase=FASTA
log:
"logs/bwa/{sample}.log"
shell: """
mkdir -p alignment
mkdir -p logs/bwa
scripts/map_bwamem_v2.sh {input.fwd} {input.rev} {params.threads} {params.indexbase} | \
samtools view -b -hF 256 - > {output} 2>{log}
"""
rule SNP_indiv:
input: "snps_out/{sample}..bam"
output: "snps_out/{sample}.raw.snps.indels.g.vcf"
params: ref_fasta=FASTA,
gatk4_jar="/storage/ppl/yifang/download-software/anaconda3/envs/exome/share/gatk4-4.1.0.0-0/gatk-package-4.1.0.0-local.jar"
log: "logs/SNPcalling/{sample}.snp_indiv.log"
shell: """
mkdir -p logs/SNPcalling
java -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
-jar {params.gatk4_jar} HaplotypeCaller \
--reference {params.ref_fasta} --input {input} \
--emit-ref-confidence GVCF --output {output} 2>{log}
"""
rule combine_candidates:
input: "snps_out/GB4450.raw.snps.indels.g.vcf"
output: "snps_out/SNPs_candidates.g.vcf"
shell: """
grep "^#" snps_out/GB4450.raw.snps.indels.g.vcf > snps_out/tmp.g.vcf
grep -v "^#" <(cat snps_out/*.raw.snps.indels.g.vcf) | sort -V >> snps_out/tmp.g.vcf
awk '!/Scaffold/ && !A[$1$2$3]++' snps_out/tmp.g.vcf > {output}
"""
rule combine_gvcf:
input: "snps_out/{sample}.raw.snps.indels.g.vcf"
output: "snps_out/combined_gvcf"
params: gvcf_file_list="gvcf_files.list",
gatk4_jar="/path/to/download-software/gatk4-4.1.0.0-0/gatk-package-4.1.0.0-local.jar"
shell:"""
java -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
-jar {params.gatk4_jar} GenomicsDBImport -V {params.gvcf_file_list} \
--genomicsdb-workspace-path {output}
"""
How to ensure the execution of each rule is completed before the one downstream starts? For example, trimming must be finished before fastqc_trimmed starts; and rule alignment must be finished before SNP_indiv starts, so on. Especially, SNP_indiv must be finished for ALL the samples before combine_candidates because combine_candidates is the union of the results all samples from SNP_indiv.
I am aware Snakemake will handle the execution of each rules and individual samples, but when dependency is related to ALL the samples, completion of the upstream rules must be guaranteed.
In bash script, the command wait can be used, what is the trick for this scenario in Snakemake?
Thanks a lot!
Thanks! Could you please elaborate more of this line to ensure the completion of trimming before fastqc_trimmed starts?
Also, is the shell command line should be like this?
I could not find more information on this from the snakemake site. I seem to catch your idea, but not understand it in full. Thanks a lot again.