Hi, I'm trying to apply Snakemake to make pipeline for analyses.
I've just begun, so I want to make simple workflow (BLAST amino acid query sequences against my database), but I have no idea why it keeps making an error. Codes I made is below:
# Configuration file
configfile: "BLAST-config.yaml"
# In configuration file,
QUERY_PATH_faa: /home/user/study/faa_input/
OUTPUT_PATH_master: /home/user/study/blast_output
DataBaseProt: /home/user/db/protein_ref
BLASTParams: "-evalue 0.01 -perc_identity 70 -word_size 10 sorthits 3"
# FILENAMES_faa contains the list of file name without extensions
FILENAMES_faa = glob_wildcards(config["QUERY_PATH_faa"]+"{fname}.faa").fname
rule all:
input:
expand(config["OUTPUT_PATH_master"]+"/BLAST/{filename}.RAW", filename=FILENAMES_faa)
rule RAWBLASTVF:
input:
expand(config["QUERY_PATH_faa"]+"{filename}.faa", filename=FILENAMES_faa)
output:
expand(config["OUTPUT_PATH_master"]+"BLAST/{filename}.RAW", filename=FILENAMES_faa)
threads:
30
shell:
"""
blastp {config[BLASTParams]} \
-db "{config[DataBaseProt}" \
-query {input} \
-out {output} \
-outfmt 7 \
-num_threads {threads}
"""
And the error I got is:
MissingInputException in line ~~ of Snakefile:
Missing input files for rule all:
Path/to/Sample_A.RAW
Path/to/Sample_B.RAW
Path/to/Sample_C.RAW
Does anyone have idea what's wrong with the codes? Thank you in advance
Hi, thank you for your comment! Now I resolve the problem, but I run into another error, unfortunately.. What I want to do is to BLAST on the file named in the list of "FILENAMES_faa", but if I use above command, snakemake input all files of the list in a single line so that BLAST makes an error (like, I want to BLAST A.faa, B.faa, C.faa one by one for input, but snakemake input them as "A.faa, B.faa, C.faa", which is 3 files for one BLAST run).
Do you have any idea to resolve this? I tried to use For loop in python, but it also makes an error..
Sorry I just saw this comment. You should change your RAWBLASTVF rule to just take a single file as input and make a single file as output (with the {filename} wildcard left unspecified in both). Then your existing "all" rule will still ask for all the various blast output files it wants as input, and instead of having Snakemake think it should run RAWBLASTVF just once to make everything, it'll notice that it needs to run that rule many times separately.
In other words, the first few lines of that rule definition can actually just look like:
(But including whatever fixes for file paths you've already made!) That way, Snakemake can do its thing and fill in the right filenames for each separate
blastp
call-- no other code (like loops) needed. Makes sense?You might also find the
-p
(--printshellcmds
) and-r
(--reason
) helpful when troubleshooting, especially with-n
(--dryrun
). The first will show any commands that will be run so you can make sure they look like what you expect, the second will show why a rule is being run (like, missing output file, updated input file, etc.) and the third won't actually run anything yet. There's also--debug-dag
which can help with more complicated workflows by explaining what rules it's selecting to supply what output files.Sorry for late reply.. And thank you for your super detailed comment! It helps me to fix the error and make the pipe more clean!