Hello all, I am very lost with snakemake and maybe someone here could help. Below a working example where I am at at this point but it need to be in a different way. In checkpoint split_to_fasta
a script will create an unknown number of fasta files that need to be used as seperate blast databases. The fasta files could be pairs but this is not always the case. After blasting I want to give the pairs as input to a new rule (step 4). I want to process all pairs in parallel and thus "step 4" can only start when all blasting is finished. In rule all
I hardcoded the "pair" id's for now but this need to be created based on the outputfiles of the previous rules. I may(?) need a second checkpoint but dont know how to implement that.
I could make a second pipeline that "starts" with step 4 and execute them after each other but that sounds like bad practice.
from glob import glob
rule all:
input:
#I added this line the make a working example
expand("output/step4/{number}_step4.txt", number=["1","2","3","4"])
checkpoint split_to_fasta:
input:
seqfile = "files/seqs.csv"#just an empty file in this example
output:
fastas=directory("output/fasta")
shell:
#A python script will create below files and I dont know them beforehand.
"mkdir -p {output.fastas} ; "
"echo test > {output.fastas}/1_LEFT_sample_name_foo.fa ;"
"echo test > {output.fastas}/1_RIGHT_sample_name_foo.fa ;"
"echo test > {output.fastas}/2_LEFT_sample_name_spam.fa ;"
"echo test > {output.fastas}/2_RIGHT_sample_name_bla.fa ;"
"echo test > {output.fastas}/3_LEFT_sample_name_egg.fa ;"
"echo test > {output.fastas}/4_RIGHT_sample_name_ham.fa ;"
rule makeblastdb:
input:
fasta = "output/fasta/{fasta}.fa"
output:
blastdb = "output/makeblastdb/{fasta}_blastdb.txt",
shell:
"cp {input.fasta} {output.blastdb}"
rule blast:
input:
file = rules.makeblastdb.output.blastdb
output:
blast = "output/blast/{fasta}_blast.txt",
shell:
"cp {input.file} {output.blast}"
def aggregate_input(wildcards):
checkpoint_output = checkpoints.split_to_fasta.get(**wildcards).output[0]
###dont know where to use this line correctly
###files = [Path(x).stem.split("_")[0] for x in glob("output/step3/"+ f"*_step3.txt") ]
return expand("output/blast/{fasta}_blast.txt", fasta=glob_wildcards(os.path.join(checkpoint_output, "{fasta}.fa")).fasta)
def get_id_files(wildcards):
blast = glob("output/blast/"+ f"{wildcards.number}*_blast.txt")
return sorted(blast)
rule step4:
input:
step3files = aggregate_input,
idfiles = get_id_files
output:
step4 = "output/step4/{number}_step4.txt",
run:
shell("cat {input.idfiles} > {output.step4}")
#input would be something like below (both sets need to run in parallel):
#[2_LEFT_sample_name_spam.fa, 2_RIGHT_sample_name_bla.fa]
#[3_LEFT_sample_name_egg.fa]
#With some python code I can determine the left and right file, if for example RIGHT is missing I just give an empty string to the script.
#shell(python script.py -left leftfile -right rightfile -o {output.step4})
For whoever runs into this same isseu I got a solution here: https://stackoverflow.com/questions/70543163/snakemake-checkpoint-and-create-a-expand-list-wildcards-based-on-created-output
Can you use a function to autodetect the resulting files for your expansion statement? Since snakeamake is python, you can define functions in your Snakefile, and your comment: "#With some python code I can determine the left and right file..." indicates there's some logic or priors you're aware of.
Thanks for the reply, I do understand your point and you are right. I partly do it this way to learn snakemake better but realise now it is confusing for others. I got a solution on stackoverflow.
I have a hard time understanding your DAG logic. Assuming contents in
seqs.csv
are not known until run time, you can try the following and go from there.Thank you for the effort, I will look at the code. It is a really different approach and helps me learning the possibilities of snakemake. I dont think it really solves my isseu but I also do realise my question is not so clear. My question is solved here: https://stackoverflow.com/questions/70543163/snakemake-checkpoint-and-create-a-expand-list-wildcards-based-on-created-output the second is example in the initial question is the working solution.