I've ran into a problem trying to update my snakemake workflow to accept input from multiple directories.
Originally I had used glob_wildcards()
_within_ my Snakefile
like this:
# Full path to raw files
SAMPLES, = glob_wildcards("{sample}_R1.fastq.gz")
rule all:
'''
The target rule for the workflow.
'''
input:
expand('fastQC/before_trim/{sample}_R1_fastqc.html', sample=SAMPLES),
expand('fastQC/before_trim/{sample}_R1_fastqc.zip', sample=SAMPLES),
This is not ideal for the obvious reasons that it is 1) Contained within the Snakefile
and 2) will only work on files from one directory (in this case the $PWD)
Rather than a user messing around with an optimized Snakefile I would like them to use a samples.txt
file that contains information for the file name and file location.
For example a samples.txt
file that looks like this:
sample location
BC1217 /foo/bar/pow/fastq
470 /wam/bam/crash/data
MTG109 /penguin/fire/station/
In addition, to the samples.txt
there is also a config.yaml
which contains information on the reference files/locations and will _eventually_ contain choices for different rules (e.g. variant calling with Freebayes
or DeepVariant
) controlled with IF/ELSE statements - but that's a whole'nother post. It looks like this:
# Files
REF_GENOME: "WS265_wormbase/c_elegans.PRJNA13758.WS265.genomic.fa.gz"
GENOME_ANNOTATION: "WS265_wormbase/c_elegans.PRJNA13758.WS265.annotations.gff3"
# Tools
QC_TOOL: "fastQC"
TRIM_TOOL: "trimmomatic"
ALIGN_TOOL: "bwa"
MARKDUP_TOOL: "picard"
CALLING_TOOL: "varscan"
ANNOT_TOOL: "coovar"
I actually based this workflow on an example I found here but it doesn't work for me.
I'll post the most pertinent parts of the code here but the full Snakefile can be found at the gist
# Load Samples ----------------------------------------------------------------
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
rule qc_before_trim:
input:
r1 = expand('{FASTQ_DIR}/{sample}_R1.fastq.gz', FASTQ_DIR=sample_locations, sample=sample_names),
r2 = expand('{FASTQ_DIR}/{sample}_R2.fastq.gz', FASTQ_DIR=sample_locations, sample=sample_names)
output:
expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_R1_fastqc.html', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names),
expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_R1_fastqc.zip', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names),
expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_R2_fastqc.html', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names),
expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_R2_fastqc.zip', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names)
log: expand('{LOG_DIR}/{sample}-{QC_TOOL}-qc-before-trim.log', QC_TOOL=config["QC_TOOL"], LOG_DIR=dirs_dict["LOG_DIR"], sample=sample_names)
resources:
mem = 1000,
time = 30
threads: 1
params:
dir = expand('{QC_DIR}/{QC_TOOL}/before_trim/', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"])
message: """--- Quality check of raw data with FastQC before trimming."""
shell:
'module load fastqc/0.11.5;'
'fastqc -o {params.dir} -f fastq {input.r1} &'
'fastqc -o {params.dir} -f fastq {input.r2}'
I guess the problem is that expand()
generates all of the combinations versus what I want is iteration over two lists. Jeremy Leipzig was kind enough to suggest the use of wildcards on Twitter.
Prior to trying this I had tried something I found from another source but it didn't work. It was a method that defined wildcards as input for each rule. Unfortunately, I cannot find the source at the moment so I'll try to take a better look in my bookmarks in a few minutes. However, I did find some code-snips to show that it looked something like this:
def get_raw_fastq(wildcards):
''' return a dict with the path to the raw fastq files'''
r1 = samples_information[samples_information["sample"] == wildcards.sample]['r1']
r2 = samples_information[samples_information["sample"] == wildcards.sample]['r2']
return {'r1': r1, 'r2': r2}
def get_fastq(wildcards):
''' return the pair of compressed fastq files for a sample.
This helper function returns the appropriate paths according to the
config file.
'''
d = {}
d['r1'] = list(samples_information[samples_information["sample"] == wildcards.sample]['r1'])
d['r2'] = list(samples_information[samples_information["sample"] == wildcards.sample]['r2'])
return d
I had a bit of trouble trying to grok how to change this code to make it work for my situation probably from my lack of familiarity with Python. I would appreciate if someone could help me solve this, thank you.
zip
) was just changing the input of the rule to your suggestion.Switch:
into
However, it's looking for the input in QC directory (which is the output of this rule and not the input)
If I don't need to use
expand()
in target rules I won't - I think your suggestion improves code readability - however, it doesn't seem to be working...When you mention intermediate targets I assume you're referring to the
coovar
rule?These are a temporary files created by
coovar
- I do not necessarily require them downstream.Given what you've said in #1 & #3 (if this does work somehow), and what's in Step 6. of the Snakemake Advanced manual could I do the following to have them deleted?
zip
:I can figure out how to include them as input using
expand()
then but I'm not sure how to do otherwise:So to summarize, your suggestions for eliminating redundancy in the
expand()
statements and usingzip()
are wonderful and seem to do the trick - thank you!Making the code more readable by entirely doing away with
expand()
would be excellent but I seem to be having some issues with this still.ok yes you can prepend your input with directories i.e
r1 = os.path.join(FASTQ_DIR,"{sample}_R1.fastq.gz")
. that's ok i would just avoidexpand
in wildcards rules.now for the zip thing you are close. you can just use
samples_set
to create a lookup functiondef getHome(sample): return(os.path.join(dict(samples_set)[sample],sample)))
and then set your directory in the input tomydir = lambda wildcards: getHome(wildcards.sample)
.Sorry but I'm still confused as to what I need to do.
Your suggestion is to use
os.path.join()
instead ofexpand()
in wildcards rules, correct?FASTQ_DIR
hasn't be defined anywhere so I cannot just use that - I assumed you meant you need to setmydir
instead?Your
def getHome(sample)
function may have contained an extra)
at the end because I got an error. I've tried the following but get an error:ok there's many ways to skin a cat but this is one strategy:
def getHome(sample): return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2'])
r1 = lambda wildcards: getHome(wildcards.sample)[0],
r2 = lambda wildcards: getHome(wildcards.sample)[1]
I'm getting the following error:
don't use expand in wildcard rules
Do you mean in the
output
of the rule?If so does this apply to
log:
andparams:
as well?I have no idea how not use
expand()
for this statement asQC_TOOL
is defined in the config fileQC_TOOL=config["QC_TOOL"]
I've even tried to simplify the
rule all
and theoutput
to not use expand and as you can see I'm struggling:Okay so switching to this now produces this error:
uhh test that getHome lookup function in isolation. dict(sample_set) should be a dictionary with the samples as keys and paths as values. the goal is to have a function that takes a sample and returns a list of 2 paths, one for each pair
ummm okay, so I'm not really sure how to troubleshoot/test this?
Not able to run the function
getHome()
as-is so I tried breaking it down into sections. Callingdict(samples_set)
gives this error so perhapssamples_set
is wrong?I read somewhere that the issue could be because of feeding in a list instead of a tuple to
dict()
:Hmmm....okay....
okay now that I verified this works let's try on
samples_set
:Omg I have no idea what's going here....
Appreciate your help figuring this out
I believe one of the errors is that
samples_set
was not defined properly. Maybe it should look like this instead?Now I can try:
So that seems to work (although I'm not sure if the order of sample/location is correct?
Not sure how to test the
getHome()
function further...ok i guess you have to explicitly make sure python runs the generator (notice the
list
cast)Okay so you used
sample_dict
here which I suppose is the same asdict(samples_set)
. I noticed I had the order for thekey
wrong earlier insamples_set
so I've fixed that now.The following change to the snakemake file:
But I'm getting the error:
ok what does
dict(samples_set)
look like? is BC1217 a key?I think this has something to do with you mixing integers and strings in your keys (sample names). Make sure you explicitly state that
470
is a string when you build the dictionary and the Snakemake target.I've spent some time searching for this but I have no idea how to do such a thing in Python (I use R).
There was this example of how to do the opposite of what you suggested.:
So I thought to myself, okay since:
I should just convert
sample_names
to"
instead of'
but this SO post say's you cannot so I'm confused...Could you please explain how I can do this?
Actually before we get too sidetracked...
I've removed the
470
sample from thesamples.txt
just to simplify things - the error persists. Besides, the error was, and still is, pointing toBC1217
.ok what i think is going on is the generator object returned by zip is getting spent without being reset, so it is empty the next time it's evaluated. just create a
dict
right away and use thatsamples_dict = dict(zip(sample_names, sample_locations))
Thanks Jeremy, I think there is some progress snow:
It looks like the
getHome()
function is now behaving properly - it returns the path to both R1/R2.Unfortunately still getting an error when trying to run snakemake.
And the error:
i don't see you using a
samples_dict
A rose by any other name
samples_set = dict(zip(sample_names, sample_locations))
So does the function then need to be changed?
If so, how?
Changing to the following errors:
samples_dict[sample]
Yes I had also tried that.
i'm running out of horizontal space, let's move this discussion to https://github.com/leipzig/biostars439754
Worked for me.
I've made a PR regarding how to include the directories that come before
"{sample}_R1_fastqc.html"
:How do I tell
"{QC_DIR}/{QC_TOOL}/{sample}_R1_fastqc.html"
comes fromQC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"])
?For example, if there was no dynamically generated
QC_TOOL
fromconfig=["QC_TOOL"]
one could simple do:os.path.join(dirs_dict["QC_DIR"] + '/' + '/' + '{sample}_R1_fastqc.html')
However, I want to include information from the
config.yaml
but it doesn't accept fromos.path.join(dirs_dict["QC_DIR"] + '/' + config["QC_TOOL"] + '/' + '{sample}_R1_fastqc.html')
:This is also redundant for both
{pair}
and{ext}
. So what's the proper, succinct, way to do this?I pushed another commit that uses what you propose
published at as a CodeOcean capsule: https://codeocean.com/capsule/4796507/tree/v1