I am getting started with Snakemake and I have a very basic question which I couldnt find the answer in snakemake tutorial.
I want to creat a pipe that could accept multiple input files, for example, several samples with different lanes. I can create the pipe that deal with just one sample, Snakefile and config.yaml are as follows :
This could successfully deal with one sample input,but failed while add another sample "KPGP-0002". How to modify the code to process multi samples with consistent pattern?
Sorry for my english and thanks in advance.
SAMPLES = []
samp_lane = []
for samp,lane in config["samples"].items():
SAMPLES.append(samp)
samp_lane = lane
I'm guessing that should be samp_lane.append(samp) in the for loop?
Either way, the wildcards probably won't work as you expect them to. It might be considerably easier if you just merge your input files so you have a single file for each sample.
If I understand your question correctly, you are dealing with two unrelated problems. One is how to pass multiple pairs of fastq files (from the same sample but from different lanes) to bwa. The other problem is how to make snakemake accept multiple files in input linked to a single wildcard in output.
For the first issues, you can concatenate fastq files as @dr_bantz suggests but this could mean creating copies of possibly massive files. I've been dealing with this issue recently and I wrote a script to concatenate and interleave multiple pairs of fastq files which can then be piped to bwa (or anything that accepts interleaved files from stdin). Script is here catInterleaveFastq.
For the second problem I think something along these lines would work, probably it could be even simplified (see snakemake Functions as Input Files):
def get_fq1(wildcards):
# code that returns a list of fastq files for read 1 based on *wildcards.sample* e.g.
return sorted(glob.glob(wildcards.sample + '*_R1_001.fastq.gz'))
def get_fq2(wildcards):
# code that returns a list of fastq files for read 2 based on *wildcards.sample*, e.g.
return sorted(glob.glob(wildcards.sample + '*_R2_001.fastq.gz'))
rule bwa_map:
input:
fq1= get_fq1,
fq2= get_fq2,
output:
bam= '{sample}.bam'
shell:
"""
catInterleaveFastq.sh -1 {input.fq1} \
-2 {input.fq2} \
| bwa mem -p genome.fa - \
| ... possibly pipe to "samtools sort" etc \
> {output}
"""
Hope this helps...
EDIT: Add -p to bwa command to enable detection of interleaved input.
Thanks a lot! Merging all lane data into one lane is good idea and easy to implement, but i guess there may exist a problem -- if we concat all fq into one file, mapping may cost much more time. And I just think of another interesting idea -- I wrote a scripts to classify all samples into a single directory with sample name, then I could write a shell scripts to execute a Snakefile for each sample. And the final directory tree may look like this:
I'm guessing that should be samp_lane.append(samp) in the for loop?
Either way, the wildcards probably won't work as you expect them to. It might be considerably easier if you just merge your input files so you have a single file for each sample.