Question

Help with clumpify.sh

0

Entering edit mode

3.0 years ago

tamu.anand ▴ 30

Hi all,

I had a question on clumpify.sh usage

My goal: I am trying to run clumpify.sh as the very 1st step of my RNASeq/WES/WGS pipeline based on these below (as listed by Brian here ) . By doing so, my thinking is that, if I start with clumped reads as Step 1 of the pipeline, the different downstream steps will benefit a lot from the reduced file sizes and possibly speed up the pipeline

Clumpify has no effect on downstream analysis aside from making it faster
If you want to clumpify data for compression, do it as early as possible (e.g. on the raw reads). Then run all downstream processing steps ensuring that read order is maintained

I want to ensure that none of my downstream steps in pipeline are affected in anyways. Hence, I was trying out clumpify.sh and comparing fastp results with and without using clumpify.sh.

Case Study 1

fastp on the original reads (no clumpify pre-processing)

Case Study 2

clumpify.sh in1=R1.fastq.gz in2=R2.fastq.gz out1=clumped_R1.fastq.gz out2=clumped_R2.fastq.gz reorder=p  
followed by fastp on the clumped reads

Observation: When I look at the fastp statistics, there are very minute differences.

Fastp results - Case Study 1

After filtering
total reads:    149.178736 M
total bases:    15.023535 G
Q20 bases:  14.815694 G (98.616568%)
Q30 bases:  14.394220 G (95.811144%)
GC content: 46.186597%
Filtering result
reads passed filters:   149.178736 M (95.630291%)
reads with low quality: 6.227876 M (3.992349%)
reads with too many N:  7.686000 K (0.004927%)
reads too short:    580.978000 K (0.372433%)

Fastp results - Case Study 2

After filtering
total reads:    149.174956 M
total bases:    15.022542 G
Q20 bases:  14.814667 G (98.616246%)
Q30 bases:  14.393230 G (95.810881%)
GC content: 46.186565%
Filtering result
reads passed filters:   149.174956 M (95.627868%)
reads with low quality: 6.228374 M (3.992668%)
reads with too many N:  7.688000 K (0.004928%)
reads too short:    584.258000 K (0.374536%)

The question: Given the above, should there be something I should be worried downstream and/or lookout for given the minute differences I have laid out above.

Thanks in advance.

fastp BBTools clumpify • 2.4k views

ADD COMMENT • link updated 2.4 years ago by Darked89 4.7k • written 3.0 years ago by tamu.anand ▴ 30

1

Entering edit mode

Only reason you would want to use clumpify.sh is if you are interested in removing sequence duplicates in an alignment free manner. Otherwise you are not likely get a lot of benefit from this procedure.

ADD REPLY • link 3.0 years ago by GenoMax 153k

0

Entering edit mode

Thanks Brian, GenoMax and ATpoint

ADD REPLY • link 3.0 years ago by tamu.anand ▴ 30

1

Entering edit mode

3.0 years ago

ATpoint 89k

I think you are way overthinking this. Basically, you are first using a tool and investing processing time into better compression but all downstream analysis including fastp and the alignment anyway has to decompress everything all over again because these tools do not work directly on compressed data but on the plain text content. I see no benefit at all in this. It should even slow things down I guess because of the additional time to run that clumpify process. Just do what everyone does, use the fastq files right away. If you want to reduce storage space then consider storing the fastq files as unmapped CRAM.

ADD COMMENT • link 3.0 years ago by ATpoint 89k

score 3 · Accepted Answer · 2022-08-08

If you keep the files decompressed, there is no reason to run Clumpify unless you turn on deduplication or error-correction. However, if you keep the files gzip-compressed, Clumpify will result in much smaller files for RNA-seq, and a side-effect is that they will compress and decompress faster and be more cache-friendly (and thus faster in some applications) since co-located reads will be contiguous.

Clumpify itself takes time so unless you have multiple subsequent steps which are rate-limited by compression or decompression, you won't usually get a net speed benefit. In our preprocessing pipelines we have 5-8 steps, each of which decompresses and compresses the entire intermediate file (to save either network bandwidth or local disk, depending on the node) and ultimately running Clumpify first is a net time saver even when deduplication is disabled; for only a single downstream step, Clumpify would usually be a net time loss - but don't underestimate the cost of memory random access and cache locality. To save long-term storage costs we would Clumpify the output anyway before writing to tape, so it's best to do it as the first step and get the speed benefit in every pipeline stage.

The minor differences in fastp statistics are not due to Clumpify changing the content, because with this command, the only thing that changes is the order of the reads. Rather, fastp is probably subsampling for speed reasons, so it's gathering statistics from a different subset of reads, depending on the order. A more interesting result would be the output of 'time' when used with fastp in each case.