Question

parallel fastQC computation

1

Entering edit mode

10.2 years ago

dolevrahat ▴ 40

Hello

I want to parallelize a fastQC computation of a large fastq file.

I know that fastQC can be run on several files in parallel using the -t option, but I'm not sure if this will yield a correct result if I will split the file into several smaller files.

Is there any way to this?

Thanks in advance
Dolev Rahat

fastqc • 7.7k views

ADD COMMENT • link updated 3.8 years ago by Ram 44k • written 10.2 years ago by dolevrahat ▴ 40

0

Entering edit mode

Hi,

I would say split a fastq file into 2 files and run fastqc with -t option. Should work fine haven't tested though

ADD REPLY • link updated 3.8 years ago by Ram 44k • written 10.2 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

To my knowledge fastQC operates on a sub-sample of your input file for its analyses, there's no need to actually process all reads for summary stats.

ADD REPLY • link updated 3.8 years ago by Ram 44k • written 10.2 years ago by Ben ★ 2.0k

Ram · Accepted Answer · 2014-10-27

3

Entering edit mode

10.2 years ago

Istvan Albert 102k

If you split the files into small parts then FastQC will run on each separately and the results are not combined.

As to an answer to your question - the problem (as always) is more complicated than it looks. In a perfect world splitting the file into random subsets would produce the similar results. But splitting into random subsets is not all that easy. In addition the way FastQC works is that it some operations will only collect information from the first 200K or so reads but will then track that information throughout the file.

Often and if the data is not skewed in particular ways you can get by analyzing a small subset of the file.

In other cases (as I have painfully learned myself) subsetting a dataset can produce reports that do not even remotely resemble the original data. But that had nothing to do with the way FastQC works.

Long story short I would run FastQC on all data then a subset of it - check if the library prep and the subject under study supports subsetting.

ADD COMMENT • link updated 3.8 years ago by Ram 44k • written 10.2 years ago by Istvan Albert 102k

0

Entering edit mode

Thank you for your insightful answer. The thing is that once I run FastQC on the entire dataset once I have no reason to run it on a subset of that dataset, because I already know the characteristics of the dataset, or am I'm missing something?

ADD REPLY • link updated 3.8 years ago by Ram 44k • written 10.2 years ago by dolevrahat ▴ 40

0

Entering edit mode

the idea is that if you have many similar samples only one would need to be fully processed.

ADD REPLY • link updated 3.8 years ago by Ram 44k • written 10.2 years ago by Istvan Albert 102k

0

Entering edit mode

In my practice a number of first reads are of a really bad quality. Basically one should never assume that a"first N reads" subset is the same in terms of certain parameters like quality as "random N reads"

ADD REPLY • link updated 3.8 years ago by Ram 44k • written 10.2 years ago by mikhail.shugay 3.5k

0

Entering edit mode

I agree, it is best to sub sample though the way most of these work is that reads still maintain the original order, the sampler just skips some entries. But then properly shuffling the file would be more time consuming than running fastqc on it.

ADD REPLY • link updated 3.8 years ago by Ram 44k • written 10.2 years ago by Istvan Albert 102k