Not knowing the exact input or output at code time is a relatively common thing in bioinformatics, and I think snakemake handles it rather well.
Let's assume taxonomy.txt
is formatted as below.
[~/Data/scratch/tmp/biostar/checkpoint]$ cat taxonomy.txt
bin unique multi tax
90-20-09-2018.001 25 15 Lactobacillus
90-20-09-2018.003 24 0 Streptococcus
90-20-09-2018.002 15 0 Lactobacillus_2
There are many ways to accomplish it, and below is an example usage of checkpoint
(https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#data-dependent-conditional-execution ).
[~/Data/scratch/tmp/biostar/checkpoint]$ cat example.py
import os, csv
def aggregate(wildcards):
"""aggregate paths from `read_input_txt` and return as list"""
with open(checkpoints.read_input_txt.get(**wildcards).output[0], 'r') as fin:
return [loc.rstrip() for loc in fin]
rule:
input: aggregate
checkpoint read_input_txt:
"""implement logic to turn `taxonomy.txt` into a list of desired files to create"""
input: 'taxonomy.txt'
output: 'files_to_be_created.txt'
run:
with open(input[0], 'r') as fin, open(output[0], 'w') as fout:
reader = csv.DictReader(fin, delimiter='\t')
writer = csv.writer(fout)
for data in reader:
writer.writerow([os.path.join('binned', data['bin'].split('.')[0], data['tax'])])
rule create_file:
"""implement logic to create the actual file"""
output: touch('{prefix}/{file}')
After snakemake -s example.py
,
[~/Data/scratch/tmp/biostar/checkpoint]$ tree binned/
binned/
└── 90-20-09-2018
├── Lactobacillus
├── Lactobacillus_2
└── Streptococcus
1 directory, 3 files
Hope you'll find this helpful and able to expand the example into a working solution.
I would say just give the rule the in and output files, why not if you have them. Otherwise snakemake never knows if something went wrong. If you really want to do this a temp file is maybe a solution https://stackoverflow.com/questions/45624969/is-there-a-way-to-chain-snakemake-rules-without-touch-files
Thanks for your answer. The problem in giving input and output, is that I don't know the name of the output files. The renaming is based on some other files, which vary according to my input file. So I can't tell snakemake the name of the output.
maybe this can also help, don't know the details myself. https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#dynamic-files I understand these things can be difficult in snakemake.
EDIT: You could also build in some kind of checks in the python script and if there are no errors the script creates a "successful_complete" file or something.
EDIT2: Another option is to create a wildcard before the rule.
Can you provide some examples of these input/output files?
Yes! Id like to refer to this thread, I made earlier: https://stackoverflow.com/questions/58623831/renaming-files-in-a-folder-according-to-string-in-another-file-with-python