Running fastqc in parallel
1
0
Entering edit mode
3 months ago
pablo ▴ 310

Hi,

I have 4 directories with 4 fastq files each. I need to run fastqc on them at the same time. I use :

sbatch -p node --mem=32G --ntasks 16 --cpus-per-task=8 --wrap='find . -type d -name data -prune -o -name '*.fastq.gz' -print | parallel -j 16 fastqc -t 8 -o {//}/ {}'

I normally have 16 parallel tasks because of the 16 fastq files , and I want to allocate 8 cpus / task (I do have enough ressources to do it). Is that work properly ? Because when I look at the top command, I see my 16 "java" jobs with the %CPU value blocked at 100%. I should have 800% each , because of the 8 cpus / task no?

Best

quality fastqc • 1.1k views
ADD COMMENT
2
Entering edit mode
ADD REPLY
0
Entering edit mode

You're right, I'll try that later. But if you could unlock my sbatch issue, it would be fine.

ADD REPLY
3
Entering edit mode

sbatch individual FastQC jobs via a for loop. You have a bonafide job scheduler (SLURM) available. There is no need to use parallel.

ADD REPLY
3
Entering edit mode
3 months ago
BioinfGuru ★ 2.1k

Hi Pablo,

I'm not familiar with submitting jobs to slurm, but is there a reason why you can't just use the -t option of fastqc without parallel? The following code snippet from my qc bash script runs fastqc (on my laptop) on all fastq files in the directory, processing 15 fastq files simultaneously until completion. Change the 15 to a lower or higher number as allowed by your machine capacity. When running on a new machine, Ill usually start with 2 to see how fast it runs, then increase and repeat until the machine screams at me, then I know the max value for -t.

# Store input and output directory paths in environmental variables
dir_in=$("path/to/fastq/files")
dir_out=$("path/to/directory/for/storing/fastqc/output")

# Prepare output directory
if [ ! -d $dir_out ]; then      # if doesn't exit
    mkdir ${dir_out}            # create it
fi

# Store the full fastq file paths in an environmental variable
raw_fastq_files=$(ls ${dir_in}/*fastq.gz)

# WRONG: Don't do this (for reasons explained by @rfran010 below)
#   for i in ${raw_fastq_files}; do 
#      fastqc "$i" -t 15 -q -o ${dir_out}; #-t (multi-threading), -q (quiet), -o (output directory)
#   done

# CORRECT: Run fastqc with multi-threading on 15 fastq files simoultaneously
fastqc "$i" -t 15 -q -o ${dir_out} ${raw_fastq_files}
ADD COMMENT
2
Entering edit mode

Sorry to butt in, but isn't the for-loop plus multi-threading redundant and maybe inefficient for fastqc?

In your example, it looks like you allocate 15 threads per file in the fastq list. So 14 threads are not in use since you are running one file at a time in the for loop. Am I missing something?

Alternately, you can use the multithreading mode directly (for loop not necessary):

    fastqc -t 15 -q -o ${dir_out} ${raw_fastq_files}

This will launch fastqc on up to 15 files at a time. So if your fastq list has 15 files, then it will run fastqc on 15 files in parallel, taking essentially the same time as if running fastqc without multithreading on 1 file. If the list has 30 files, then it will run on 15 in parallel and then start the next files after each run completes, so essentially the time it takes to run fastqc on 2 files.

ADD REPLY
2
Entering edit mode

Wow. You are spot on. I completely missed that. Thank you!

Test results:

  • Using the for loop completes 1 fastq at a time, and even though 15 threads are allocated, only 2 are ever used. Run time: 21 min on 10 fastq files
  • Using your code (aka "doing it properly"), simoultaneously runs 15 fastq files, 15 threads used. Run time: 7 min on same fastq files

That's a big oops!

ADD REPLY
1
Entering edit mode

That works pretty well, thanks.

ADD REPLY
0
Entering edit mode

You are welcome.

ADD REPLY

Login before adding your answer.

Traffic: 1558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6