Hello everyone, I'm trying to do a pipeline in Snakemake but ran into a problem which I couldn't really find solution to, so I would be very happy if someone could help. In our lab we have tons and tons of patients with data which we have to analyze regularly, so I'm starting to wonder if Snakemake is actually the correct tool (even given its reproducibility). Anyways.
Say, we have patient fastqs which look like this: 111111-231228-H4_S201_L001_R1_001.fastq.gz 111111-231228-H4_S201_L001_R2_001.fastq.gz
Where 111111 - his ID;
231228 - date of analysis;
H4 - enrichment.
Sometimes we have multiple lanes, sometimes we don't.
What I want is something like: snakemake -d path/to/fastq_dir -f *231228-H4*
for the pipeline to work only with fastqs in their specific folder (named after a panel like Renal), with a specific filter in the name (in this case, only those patients who's data we obtained on 28.12.23, in enrichment 4).
I have a beautiful bash script that does this filtering for me, but I struggle to implement it in Snakemake. It filters by '231228-H4':
#main loop
for sample in $(ls *.fastq.gz | grep $filter | awk -F "_" '{print $1"_"$2}' | uniq)
do
let lanes=$(ls | grep 107328 | wc -l)/2
echo $lanes
for lane in $(seq 1 $lanes)
do
#First stage minimap2 and samtools
echo "Starting minimap2 for $sample, lane $lane"
minimap2 -t 1 -ax sr /home/maria/Documents/Reference_genomes/hg38.fa ${sample}_L00${lane}_R1_001.fastq.gz \
${sample}_L00${lane}_R2_001.fastq.gz > tmp/${sample}_pe.sam
#minimap2 -t 8 -ax sr hg38.fa *_1.fq.gz *_2.fq.gz > *.sam
echo "Converting sam to bam for $sample, lane $lane"
samtools view -@ 4 -Sb ./${sample}_pe${lane}.sam > ./${sample}_pe${lane}.bam
done
done
Basically, 'sample' in here will be something like 111111_231228_H4. Finally, here is part of my Snakefile:
configfile: "config.yaml"
rule all:
input:
"plots/quals.svg"
rule bwa_map:
input:
"/home/maria/Documents/Reference_genomes/hg38.fa",
get_bwa_map_input_fastqs
output:
temp("mapped_reads/{sample}.bam")
params:
rg=r"@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_mem/{sample}.log"
benchmark:
"benchmarks/{sample}.bwa.benchmark.txt"
threads: 6
shell:
"bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output} 2> {log}"
(yeah one has bwa mem and other has minimap, but it doesn't matter for now)
You'll probably want the sample names in a sample sheet, and then in the config file have the path to the sample sheet and fastq dir. You can then use these to glob for the file matches using pathlib from within the snakefile.
You'll probably want to store everything as a dictionary with sample name as the key, and the values are lists for the R1 and R2 files.
Given that you're just learning snakemake, I'd recommend using named inputs so you can explicitly specify where each input goes. What is
get_bwa_map_input_fastqs
? Look into theexpand()
function like shown here: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-expand-function