I have a list of SRA accession numbers that I download using sratoolkit
's fasterq-dump
. Since I have a number of samples, instead of downloading them serially, I take advantage of array jobs. The script I use is the following:
#!/bin/bash
#$ -cwd
#$ -V
#$ -t 1-44
read -a samples <<< $(cut -d , -f 1 linker.csv | tail -n +2)
false_index=$SGE_TASK_ID
true_index=$((false_index-1))
sample=${samples[$true_index]}
fasterq-dump $sample -O raw_samples >> downloading.log
Short version is that I read the files to download into an array. I specify the amount of threads with -t
flag. Every task gets id ($SGE_TASK_ID
variable) from 1 to 44. Based on that id, sample is retrieved via indexing, and then I run the command fasterq-dump
to download it. This script works wonders.
But I want to implement it in a snakemake
pipeline, and in order to make that pipeline work, I need to use it's -cluster
flag. I've managed to write one snakemake
rule to download files, but only serially. I have no idea how I'd implement parallelization (technically, an array job). The snakemake
script is below:
import csv
def read_dictionary():
with open("linker.csv") as csv_file:
csv_reader=csv.reader(csv_file,delimiter=",")
dic = {row[0]:row[2] for row in csv_reader}
return(dic)
SRA_MAPPING = read_dictionary()
SRAFILES = list(SRA_MAPPING.keys())[1:]
rule download_files:
output:
"raw_samples/download.log"
run:
for file in SRAFILES:
#shell("touch raw_samples/{file} >> raw_samples/download.log")
shell("fasterq-dump {file} -O raw_samples >> {output}")
In a nutshell, it reads the samples into global variable SRAFILES
, and then runs the python code that selects a file from said variable, then runs fasterq-dump
. How would I implement "parallelization" of one job/rule?
I will try that, just to see if it's done in parallel. Just before I try it out, I know that my list has let's say 10 elements. Could I implement
-j 10
to specify that I want 10 parallel tasks be done? And for future proofing, if I have a number of rules, but only some require parallel processing, could I specify which rule should use how many threads? I guess what I'm asking is if there is a way to transfer"qsub -t 1-10"
to rules that require parallel processing.EDIT: That worked, thank you. For posterity, my full command line argument was:
snakemake --jobs 1 --cluster "qsub -V -cwd -now y"
Ivan, look into the resources directive to manage the number of jobs, memory usage etc. I haven't used it much but you should have quite a bit of flexibility there. If you get stuck, post here...
Yes, that will have snakemake have at most 10 parallel jobs running.
--jobs 1
effectively disables parallelization.