Question

Giving filtered input fastqs to Snakemake

0

Entering edit mode

23 months ago

mary.v.volkova ▴ 20

Hello everyone, I'm trying to do a pipeline in Snakemake but ran into a problem which I couldn't really find solution to, so I would be very happy if someone could help. In our lab we have tons and tons of patients with data which we have to analyze regularly, so I'm starting to wonder if Snakemake is actually the correct tool (even given its reproducibility). Anyways.

Say, we have patient fastqs which look like this: 111111-231228-H4_S201_L001_R1_001.fastq.gz 111111-231228-H4_S201_L001_R2_001.fastq.gz

Where 111111 - his ID; 231228 - date of analysis; H4 - enrichment. Sometimes we have multiple lanes, sometimes we don't. What I want is something like: snakemake -d path/to/fastq_dir -f *231228-H4* for the pipeline to work only with fastqs in their specific folder (named after a panel like Renal), with a specific filter in the name (in this case, only those patients who's data we obtained on 28.12.23, in enrichment 4).

I have a beautiful bash script that does this filtering for me, but I struggle to implement it in Snakemake. It filters by '231228-H4':

#main loop
for sample in $(ls *.fastq.gz | grep $filter | awk -F "_" '{print $1"_"$2}' | uniq)
do

 let lanes=$(ls | grep 107328 | wc -l)/2
 echo $lanes

 for lane in $(seq 1 $lanes)
 do

    #First stage minimap2 and samtools
    echo "Starting minimap2 for $sample, lane $lane"
    minimap2 -t 1 -ax sr /home/maria/Documents/Reference_genomes/hg38.fa ${sample}_L00${lane}_R1_001.fastq.gz \
    ${sample}_L00${lane}_R2_001.fastq.gz > tmp/${sample}_pe.sam
    #minimap2 -t 8 -ax sr hg38.fa *_1.fq.gz *_2.fq.gz > *.sam
    echo "Converting sam to bam for $sample, lane $lane"
    samtools view -@ 4 -Sb ./${sample}_pe${lane}.sam > ./${sample}_pe${lane}.bam 

 done

done

Basically, 'sample' in here will be something like 111111_231228_H4. Finally, here is part of my Snakefile:

configfile: "config.yaml"

rule all:
    input:
        "plots/quals.svg"


rule bwa_map:
    input:  
        "/home/maria/Documents/Reference_genomes/hg38.fa",
        get_bwa_map_input_fastqs
    output: 
        temp("mapped_reads/{sample}.bam")
    params:
        rg=r"@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    benchmark:
        "benchmarks/{sample}.bwa.benchmark.txt"
    threads: 6
    shell: 
        "bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output} 2> {log}"

(yeah one has bwa mem and other has minimap, but it doesn't matter for now)

fastq pipeline bwa python snakemake • 1.2k views

ADD COMMENT • link updated 23 months ago by rpolicastro 13k • written 23 months ago by mary.v.volkova ▴ 20

0

Entering edit mode

You'll probably want the sample names in a sample sheet, and then in the config file have the path to the sample sheet and fastq dir. You can then use these to glob for the file matches using pathlib from within the snakefile.

You'll probably want to store everything as a dictionary with sample name as the key, and the values are lists for the R1 and R2 files.

ADD REPLY • link 23 months ago by rpolicastro 13k

0

Entering edit mode

Given that you're just learning snakemake, I'd recommend using named inputs so you can explicitly specify where each input goes. What is get_bwa_map_input_fastqs? Look into the expand() function like shown here: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-expand-function

ADD REPLY • link 23 months ago by Ram 45k