Hi,
I'm trying to subsample a large collection of processed fastq files in a Snakemake workflow. I want to sequentially subsample these files by one third, a total of six times. All my fastq files are different sizes, so I expect that some of the subsamples will be empty before the sixth iteration is complete. This will cause the program ( reformat.sh from bbtools suite ) to exit successfully but not output a file. As a result I get a MissingOutputException from Snakemake and my workflow for those particular samples fails. I tried using checkpoints to manage the dynamic number of output files, but what I would like is a way to handle the MissingOutputException that I get from Snakemake so that the checkpoint just registers that there are fewer files than expected. Does anyone know how to do that or can you suggest a better strategy for serially subsampling within Snakemake? Maybe I'm using checkpoints incorrectly? This is tangentially related to this stack overflow post, which gave a more clever solution for reusing the same snakemake rule (though I couldn't get it to work because of a CyclicGraphException, see Snakemake docs)
Snakemake error message:
MissingOutputException in line 187 of /home/nmb/skin_metag/process_reads/Snakefile:
Job completed successfully, but some output files are missing. Missing files after 1 seconds:
SRP188475/SRR8731712/SRR8731712.subset5.lca_gather.csv
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
File "/home/nmb/.local/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 544, in handle_job_success
File "/home/nmb/.local/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 225, in handle_job_success
Relevant code snippet is here:
#rules
rule all:
input:
expand("{run}.subset{subset}.lca_gather.csv", run = names_list, subset = ['1','2','3','4','5','6'])
def getFastq(run):
return(names_dict[run])
def getOut(run):
parts = run.split('/')
out_prefix = '/'.join([parts[0],parts[1]])
return(out_prefix)
rule fastqc:
input:
r1 = lambda wildcards: getFastq(wildcards.run)[0],
r2 = lambda wildcards: getFastq(wildcards.run)[1]
params:
out_prefix = lambda wildcards: getOut(wildcards.run)
output:
"{run}_1_fastqc.html",
"{run}_1_fastqc.zip",
"{run}_2_fastqc.html",
"{run}_2_fastqc.zip"
threads: max_threads
message: "STEP 1 - Running FastQC to check raw read quality."
shell: "fastqc -o {params.out_prefix}/ -f fastq {input.r1} {input.r2}"
rule bbduk_trim:
input:
r1 = lambda wildcards: getFastq(wildcards.run)[0],
r2 = lambda wildcards: getFastq(wildcards.run)[1]
params:
out_prefix = lambda wildcards: getOut(wildcards.run)
output:
"{run}.trimmed.fastq.gz",
"{run}.bbduk.stats"
threads: max_threads/10
message: "STEP 2 - Removing sequencing adapters, control DNA sequences, etc., and quality trimming reads."
shell: "bbduk.sh -Xmx10g in={input.r1} in2={input.r2} out=stdout.fq ref=contamination.fa k=27 ktrim=r hdist=2 mink=16 threads={threads} | bbduk.sh -Xmx10g threads={threads} int=t in=stdin.fq out={output[0]} ref=contamination.fa k=23 ktrim=r hdist=0 ow mink=10 stats={output[1]}"
rule bbmap_clean:
input:
"{run}.trimmed.fastq.gz"
output:
"{run}.nonhuman.fastq.gz",
"{run}.human.fastq.gz",
"{run}.bbmap.stats"
threads: max_threads/10
message: "STEP 3: Removing reads that map to the human genome and saving them aside."
shell: "bbmap.sh minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=./ qtrim=rl trimq=10 untrim -Xmx24g in={input} outu={output[0]} outm={output[1]} threads={threads} statsfile={output[2]}"
rule bbnorm_lowpass:
input:
"{run}.nonhuman.fastq.gz"
output:
"{run}.highpass.fastq.gz",
"{run}.lowpass.fastq.gz",
"{run}.bbnorm.stats"
threads: max_threads/10
message: "STEP 4 - Removing reads with k-mers that occur less than 3x."
shell: "bbnorm.sh -Xmx20g in={input} out={output[0]} outt={output[1]} passes=1 target=999999999 min=3 threads={threads} histout={output[2]}"
rule sourmash_compute:
input:
"{run}.highpass.fastq.gz"
output:
"{run}.sig,^(?!subset).*"
threads: 1
message: "STEP 5 - Computing the minhash signature with sourmash."
shell: "sourmash compute --track-abundance --scaled 1000 -k 21,31,51 {input} --output {output}"
rule sourmash_compare:
input:
signatures_list
output:
"compare_k31",
"compare_k31.csv",
"compare_k51",
"compare_k51.csv"
threads: max_threads
message: "STEP 5 - Comparing all 31-mer (species level) and all all 51-mer (species level) minhash signatures against each other, accounting for kmer abundance (abundance projection)."
shell: "sourmash compare --processes {threads} -k 31 {input} --output {output[0]} --csv {output[1]} & sourmash compare --processes {threads} -k 51 {input} --output {output[2]} --csv {output[3]}"
rule sourmash_gather:
input:
"{run}.sig",
"genbank.lca"
output:
"{run}.lca_gather.csv,^(?!subset).*"
threads: 1
message: "STEP 6 - Gathering genomes contained in the metagenome with sourmash."
shell: "sourmash gather -k 31 --output {output} {input[0]} {input[1]}"
checkpoint reformat_subset1:
input:
"{run}.highpass.fastq.gz"
output:
"{run}.subset1.fastq.gz"
threads: max_threads/10
message: "STEP 7.1 - Subsetting processed reads by 0.3x."
shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"
checkpoint reformat_subset2:
input:
"{run}.subset1.fastq.gz"
output:
"{run}.subset2.fastq.gz"
threads: max_threads/10
message: "STEP 7.2 - Subsetting subset #1 reads by 0.3x."
shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"
checkpoint reformat_subset3:
input:
"{run}.subset2.fastq.gz"
output:
"{run}.subset3.fastq.gz"
threads: max_threads/10
message: "STEP 7.3 - Subsetting subset #2 reads by 0.3x."
shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"
checkpoint reformat_subset4:
input:
"{run}.subset3.fastq.gz"
output:
"{run}.subset4.fastq.gz"
threads: max_threads/10
message: "STEP 7.4 - Subsetting subset #3 reads by 0.3x."
shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"
checkpoint reformat_subset5:
input:
"{run}.subset4.fastq.gz"
output:
"{run}.subset5.fastq.gz"
threads: max_threads/10
message: "STEP 7.5 - Subsetting subset #4 reads by 0.3x."
shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"
checkpoint reformat_subset6:
input:
"{run}.subset5.fastq.gz"
output:
"{run}.subset6.fastq.gz"
threads: max_threads/10
message: "STEP 7.6 - Subsetting subset #5 reads by 0.3x."
shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"
checkpoint sourmash_subset_compute:
input:
"{run}.subset{subset}.fastq.gz"
output:
"{run}.subset{subset}.sig"
threads: 1
message: "STEP 8 - Computing the minhash signature with sourmash."
shell: "sourmash compute --track-abundance --scaled 1000 -k 21,31,51 {input} --output {output}"
checkpoint sourmash_subset_gather:
input:
"{run}.subset{subset}.sig",
"genbank.lca"
output:
"{run}.subset{subset}.lca_gather.csv"
threads: 1
message: "STEP 9 - Gathering genomes contained in the subset metagenome with sourmash."
shell: "sourmash gather -k 31 --output {output} {input[0]} {input[1]}"
How do you have the output files defined in
rule all
?I'll just edit the post to show all of the snakemake rules above.
I think the issue here is that you're telling snakemake to expect subset iteration 1 through 6 for all of your .fastq files. Do you need the intermediate subset .fastq files for what you're trying to achieve?
Assuming you do need all of your intermediate files, I found this answer on stackoverflow very helpful when I was doing something similar.