In my current project, I have ~560 RNA-seq fastq files (120 samples) that were sequenced in various locations. As such, the naming structure of the files are different, some files are zipped and some are unzipped, the files are dotted around our databank in various locations and the sample ID for each file is encoded either at the start of the file name or in the folder that the file is located in. Normally, I would standardise the naming structure of all of these files using multiple awk and sed commands, but I'm finally trying to integrate this file preparation part of the analysis into snakemake
. I've been using snakemake
for a while now, but my python
is not great yet.
To move and rename the files I created a .tsv
file called all_samples_nameKey
(note the original names/locations are abbreviated here for sanity).
fastq sampleID
dirA/1117_S7_L005_R1_001.fastq.gz 1117_Crdf_L5_R1.fastq.gz
dirA/1117_S7_L005_R2_001.fastq.gz 1117_Crdf_L5_R2.fastq.gz
dirA/1117_S7_L006_R1_001.fastq.gz 1117_Crdf_L6_R1.fastq.gz
dirA/1117_S7_L006_R2_001.fastq.gz 1117_Crdf_L6_R2.fastq.gz
dirB/18983/D00125_0240_8_1.sanfastq.gz 18983_Edin_L8_R1.fastq.gz
dirB/18983/D00125_0240_8_2.sanfastq.gz 18983_Edin_L8_R2.fastq.gz
dirC/17109_CAGATC_L004_R1_001.fastq 17109_Extr_L4_R1.fastq
dirC/17109_CAGATC_L004_R2_001.fastq 17109_Extr_L4_R2.fastq
with the following Snakefile
:
import numpy
import pandas as pd
import os
configfile: "config.yaml"
# ---------- SET VARIABLES ----------
SCRATCH = config["SCRATCH"]
sample_file = config['sample_list']
sample_df = pd.read_table(sample_file, sep="\t+", header=0)
sampleID = sample_df.sampleID
fastq = sample_df.fastq
MERGE_FILES = json.load(open(config['MERGE_FILES_JSON']))
ALL_MERGED_SAMPLES = sorted(MERGE_FILES.keys())
# ------------- RULES --------------
localrules: move_and_rename_fastqs
rule all:
input:
expand(SCRATCH + "/01RAW_fqs/{sample}", sample = sample_df.sampleID),
expand(SCRATCH + "/02ZIPPED_fqs/{sample}", sample = sample_df.sampleID),
expand(SCRATCH + "/03MRGD_fqs/{sampleID}_R2.fastq.gz", sampleID = ALL_MERGED_SAMPLES),
rule move_and_rename_fastqs:
input: fastq = lambda w: sample_df[sample_df.sampleID == w.sample].fastq.tolist()
output: temp(SCRATCH + "/01RAW_fqs/{sample}")
log: SCRATCH + "/00logs/01RAW_fqs/{sample}.log"
shell:
"""cp {input.fastq} {output} 2> {log}"""
rule zip_fastqs:
input: SCRATCH + "/01RAW_fqs/{sample}"
output: temp(SCRATCH + "/02ZIPPED_fqs/{sample}")
params: outdir = SCRATCH + "/02ZIPPED_fqs/"
run:
if wildcards.sample.endswith('.fastq'):
shell("echo gzip {input}")
shell("echo mv {input}.gz {params.outdir}")
else:
shell("mv {input} {params.outdir}")
rule merge_fastqs:
input: r1 = lambda wildcards: MERGE_FILES[wildcards.sampleID]['R1'],
r2 = lambda wildcards: MERGE_FILES[wildcards.sampleID]['R2']
output: r1 = SCRATCH + "/03MRGD_fqs/{sampleID}_R1.fastq.gz",
r2 = SCRATCH + "/03MRGD_fqs/{sampleID}_R2.fastq.gz"
log: r1 = SCRATCH + "/00logs/03MRGD_fqs/{sampleID}_R1.log",
r2 = SCRATCH + "/00logs/03MRGD_fqs/{sampleID}_R2.log"
params: indir = SCRATCH + "/02ZIPPED_fqs/", outdir = SCRATCH + "/03MRGD_fqs/"
run:
if len(input.r1) > 1:
print(wildcards.sample, ": has > 1 fastq file per read.\n", {input.r1})
shell("cat {params.indir}{wildcards.sampleID}*_R1.fastq.gz > {output.r1} 2> {log.r1}"),
shell("cat {params.indir}{wildcards.sampleID}*_R2.fastq.gz > {output.r2} 2> {log.r2}")
else:
print(wildcards.sample, ": has only 1 fastq file per read.\n", {input.r1})
shell("mv {params.indir}{wildcards.sampleID}*_R1.fastq.gz {params.outdir}{wildcards.sampleID}_R1.fastq.gz 2> {log.r1}"),
shell("mv {params.indir}{wildcards.sampleID}*_R2.fastq.gz {params.outdir}{wildcards.sampleID}_R2.fastq.gz 2> {log.r2}")
The first rule when run on it's own works nicely and transfers the files over as expected. The snakemake -np
also completes the proper number of jobs for all three rules. However when I add in the zip_fastqs
rule, and the snakemake -np
output is properly scrutinised, the {sample}
wildcard in the first rule is backfilled from {sample}
in the second rule, meaning the sampleIDs in the inputs and outputs of the move_and_rename_fastqs
rule no longer match.
If I run the first rule and second rule independently, one after the other, both create the files needed. However, I get an error for all instances of the zip_fastqs
rule where the {input}
has the .fastq
extension. As these files will be zipped in the zip_fastqs
rule, the final file has a .gz
extension which doesn't match what is specified in the {output}
for that rule.
Also, with the merge_fastq
rule, as the number of files needing merged can range between 0-6 per read, I thought it best to use a json
file to plug the required number of files into the input rather than expand
, but the lack of consistency in the {wildcards}
across the 3 rules means I need to store multiple copies of the fastqs which is pointless. I tried using temp()
, to deal with this but the {wildcards}
consistency issue means I need to keep the expand
functions in rule all
in order that each rule can produce the files I need.
So what I'm looking to do is:
- Collect and standardise the naming structure of the fastqs -
move_and_rename_fastqs
- Zip any unzipped files -
zip_fastqs
- Merge files that were sequenced over many lanes -
merge_fastqs
- Remove fastqs that I don't need - ??
My rational is put all files through every rule and use the if statements determine whether to process them or not (I moved the files into different folders in each rule to stop filename duplicate issues). I feel like I'm almost there but not quite sure how to tweak the Snakefile to parse the various {input}/{output}
requirements of each rule. I'm sure there must be a simpler, or more efficient way to do this - or at least a way to get this working as intended. Ideally, I could keep the same {wildcards}
over the three rules, but I'm not sure if this is possible as each sample has or has not been sequenced over multiple lanes, and/or is zipped or unzipped.
As I said, python isn't my strong suit - indeed this is the first time I've tried using if statements in a Snakefile
- so any advice/comments/suggestions on how to solve these issues would be greatly appreciated.
I don't know pandas (seems to be overkill to use this for parsing a simple two-column table), but can you add a filter to
so that you will only have the unzipped fastqs in
sample
?And then I'd add the
.gz
extension to the targets:and
Additionally,
should be:
Thanks for your suggestions. Re: if statement - Although I think both work, your way is less verbose so I have changed it in the IP.
Both work, yes, but
== True
(or!= True
or ... ) is considered "bad practice". Just pointing it out.Ah ... I didn't know that. Thanks for the heads up!
Response to first post: Probably, yes. However, I was hoping to standardise the
{wildcards}
across the rules. Your suggestion would add more{wildcards}
, but that may be the cleanest way to do it. What do you mean pandas may be overkill? What would you suggest as an alternative?I'm not sure that it would add more wildcards. However, you need to differentiate between files that need to be gzipped and files that don't. From my experience with routing different sample types (I didn't have the zipped/non-zipped problem, but I have had single-end vs paired end and even RNAseq vs DNA) through the same snakemake workflow, I find it easier to process only those files that need to be processed. So far, I have had good experience with pre-sorting samples the way I proposed.
An alternative would be to just zip all the non-zipped data beforehand (non-compressed data should not exist anyway in this day and age), have this as a requirement for your workflow and then just not bother with the unzipped case.
Concerning pandas, I mean, all you do is reading in a two column tsv and then use that to assign the proper files to a sample. All that could be achieved with csv.reader, an extra function and a dict. As I said, I don't know my way around pandas, but I'd only consider to use that in a context where I'd need to process larger data than in your case. Not saying it's wrong (it isn't), it just seems unnecessary to me and adds an extra dependency in case you want to publish your workflow.
No probs. Fair points. Most of the snakemake walkthrough/tutorials describe the functionality of
{wildcards}
in relation to a single{wildcard}
that remains constant and flows through the most, if not all of the rules i.e.{sample}
. I've yet to find one that describes the best way do deal with the kind of issues I describe. I will try an implement your suggestions. Many Thanks.{sample}
and{sample}.gz
still use the same{sample}
wildcard. In your case,{sample}
is the whole filename, but in theory, you could also construct a filename like/path/to/file/{sample}_R1.fastq.gz
, with{sample}
only storing the file prefix. It depends how you intend to process your data. In a lot of cases, when working with snakemake workflows, it really makes sense to "normalise" sample/filenames before processing in order to avoid edge cases.I understand that. The issue is that the
/path/to/file/
andfilenames
of the original 120 files have no consistency. This was the main reason for using pandas in the first rule, in order to standardise, or to use your term, normalise the sample/filenames so that I could start using the typical/path/to/file/{sample}_R1.fastq.gz
you mention. All the subsequent rules I have for this analysis use this 'normalised' format. The point of the question was to attempt to incorporate this filename 'normalisation' into the Snakefile rather than doing it beforehand.