Executing snakemake array jobs on cluster
1
2
Entering edit mode
2.8 years ago
Ivan ▴ 60

I have a list of SRA accession numbers that I download using sratoolkit's fasterq-dump. Since I have a number of samples, instead of downloading them serially, I take advantage of array jobs. The script I use is the following:

#!/bin/bash
#$ -cwd
#$ -V
#$ -t 1-44
read -a samples <<< $(cut -d , -f 1 linker.csv | tail -n +2)
false_index=$SGE_TASK_ID
true_index=$((false_index-1))
sample=${samples[$true_index]}
fasterq-dump $sample -O raw_samples >> downloading.log

Short version is that I read the files to download into an array. I specify the amount of threads with -t flag. Every task gets id ($SGE_TASK_ID variable) from 1 to 44. Based on that id, sample is retrieved via indexing, and then I run the command fasterq-dump to download it. This script works wonders.

But I want to implement it in a snakemake pipeline, and in order to make that pipeline work, I need to use it's -cluster flag. I've managed to write one snakemake rule to download files, but only serially. I have no idea how I'd implement parallelization (technically, an array job). The snakemake script is below:

import csv
def read_dictionary():
    with open("linker.csv") as csv_file:
        csv_reader=csv.reader(csv_file,delimiter=",")
        dic = {row[0]:row[2] for row in csv_reader}
    return(dic)

SRA_MAPPING = read_dictionary()
SRAFILES = list(SRA_MAPPING.keys())[1:]

rule download_files:
    output:
        "raw_samples/download.log"
    run:
        for file in SRAFILES:
            #shell("touch raw_samples/{file} >> raw_samples/download.log")
            shell("fasterq-dump {file} -O raw_samples >> {output}")

In a nutshell, it reads the samples into global variable SRAFILES, and then runs the python code that selects a file from said variable, then runs fasterq-dump . How would I implement "parallelization" of one job/rule?

snakemake cluster • 2.1k views
ADD COMMENT
3
Entering edit mode
2.8 years ago

So SRAFILES is a list of SRR IDs, right? If so, I would make the SRR id a wildcard and let snakemake handle the for-loop. E.g.:

SRAFILES = ['SRR123', 'SRR456']

rule all:
    input:
        expand('raw_samples/{srr}.log', srr= SRAFILES),

rule download_files:
    output:
        xlog= 'raw_samples/{srr}.log',
    shell:
        r"""
        fasterq-dump {wildcards.srr} -O raw_samples > {output.xlog}
        """

Then execute with -j option set to a reasonable number of jobs to run in parallel:

snakemake -j 10 -p --cluster "qsub ..."

Basically, you are asking snakemake to produce one log file per SRR id and this log file is produced by rule download_files. As a side product, download_files will give you the actual fastq files (things could be done differently in this respect but hopefully this will help...)

ADD COMMENT
0
Entering edit mode

I will try that, just to see if it's done in parallel. Just before I try it out, I know that my list has let's say 10 elements. Could I implement -j 10 to specify that I want 10 parallel tasks be done? And for future proofing, if I have a number of rules, but only some require parallel processing, could I specify which rule should use how many threads? I guess what I'm asking is if there is a way to transfer "qsub -t 1-10" to rules that require parallel processing.

EDIT: That worked, thank you. For posterity, my full command line argument was: snakemake --jobs 1 --cluster "qsub -V -cwd -now y"

ADD REPLY
0
Entering edit mode

Ivan, look into the resources directive to manage the number of jobs, memory usage etc. I haven't used it much but you should have quite a bit of flexibility there. If you get stuck, post here...

Could I implement -j 10 to specify that I want 10 parallel tasks be done?

Yes, that will have snakemake have at most 10 parallel jobs running. --jobs 1 effectively disables parallelization.

ADD REPLY

Login before adding your answer.

Traffic: 1479 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6