Hi,
I have sequenced DNA samples with Illumina NextSeq which produces 8 FASTQ files per sample. As a part of my snakemake
pipeline I would like to merge these files into two larger FASTQ files (forward/reverse) and continue working with these files. I have tried to figure out how this could be done with 'snakemake', so far with no success. My starting point and goal is similar to following example:
#Input files for tumor sample:
./patientAtumor-131216-42530784/patientAtumor_S1_L001_R1_001.fastq.gz
./patientAtumor-131216-42530784/patientAtumor_S1_L001_R2_001.fastq.gz
./patientAtumor-131216-42530784/patientAtumor_S1_L002_R1_001.fastq.gz
./patientAtumor-131216-42530784/patientAtumor_S1_L002_R2_001.fastq.gz
./patientAtumor-131216-42530784/patientAtumor_S1_L003_R1_001.fastq.gz
./patientAtumor-131216-42530784/patientAtumor_S1_L003_R2_001.fastq.gz
./patienttAumor-131216-42530784/patientAtumor_S1_L004_R1_001.fastq.gz
./patientAtumor-131216-42530784/patientAtumor_S1_L004_R2_001.fastq.gz
# Input files for normal (reference) sample
./patientAnormal-131216-42520895/patientAnormal_S2_L001_R1_001.fastq.gz
./patientAnormal-131216-42520895/patientAnormal_S2_L001_R2_001.fastq.gz
./patientAnormal-131216-42520895/patientAnormal_S2_L002_R1_001.fastq.gz
./patientAnormal-131216-42520895/patientAnormal_S2_L002_R2_001.fastq.gz
./patientAnormal-131216-42520895/patientAnormal_S2_L003_R1_001.fastq.gz
./patientAnormal-131216-42520895/patientAnormal_S2_L003_R2_001.fastq.gz
./patientAnormal-131216-42520895/patientAnormal_S2_L004_R1_001.fastq.gz
./patientAnormal-131216-42520895/patientAnormal_S2_L004_R2_001.fastq.gz
#Desired output would be:
./patientA/patientAtumor_R1.fastq.gz
./patientA/patientAtumor_R2.fastq.gz
./patientA/patientAnormal_R1.fastq.gz
./patientA/patientAnormal_R2.fastq.gz
# Original files can be deleted after merging is complete.
I have tried to construct two 'snakemake' rules that could complete the job but I just can't figure out how I should choose the wildcards. Especially difficult parts in input files are directory names (e.g. -131216-42520895 produced by Illumina BaseSpace) and sample ID in the FASTQ name (e.g. S1 and S2 in examples).
DIRECTORY, SAMPLES, IDS = glob_wildcards("{directory}/{sample}_S{id}_L001_R1_001.fastq.gz")
DIRECTORY, PATIENTS, IDS = glob_wildcards("{directory}/{patient}normal_S{id}_L001_R1_001.fastq.gz"
rule all:
input: expand('{patient}/{sample}_R1.fastq.gz', patient=PATIENTS, sample=SAMPLES),
expand('{patient}/{sample}_R2.fastq.gz', patient=PATIENTS, sample=SAMPLES)
rule mergeFastqR1:
input: ?
output: {patient}/{sample}_R1.fastq.gz
shell: 'cat {input} > {output}'
rule mergeFastqR2:
input: ?
output: {patient}/{sample}_R1.fastq.gz
shell: 'cat {input} > {output}'
The number of wildcards is unequal in different parts of rules which causes problems when running the pipeline. I think that the use of wildcards is the most difficult thing in 'snakemake' especially if you don't have strong background in Python. I can perform the flollowing tas with bash
but I would be more than grateful if someone could suggest how this job could be done properly with snakemake. If anyone has time to help and figure out some sort of solution, I own you my gratitude.
Thank you in advance!
Try not listing an input, just an output and then use
cat {wildcard.sample}_S*_R1.fastq.gz > {output}
in the shell.N.B., bash will expand globs (
*
) alphabetically. If you ever do something like this outside of bash, then double check that that's being done!Thanks Devon, it works like a charm! It was so easy solution that I didn't even think about it.
Could you be more specific ? What should we put as input then ? How will the mergeFastqR1 rule looks like ? thanks
Or something along those lines
Except when you have a samples with _S in the name - it will merge them. e.g. if you have samples X and X_S - them samples will end up being combined.
Sure, but since such a case wasn't present there was no reason to include it in the solution.