Hi all,
I had a question on clumpify.sh usage
My goal: I am trying to run clumpify.sh as the very 1st step of my RNASeq/WES/WGS pipeline based on these below (as listed by Brian here ) . By doing so, my thinking is that, if I start with clumped reads as Step 1 of the pipeline, the different downstream steps will benefit a lot from the reduced file sizes and possibly speed up the pipeline
- Clumpify has no effect on downstream analysis aside from making it faster
- If you want to clumpify data for compression, do it as early as possible (e.g. on the raw reads). Then run all downstream processing steps ensuring that read order is maintained
I want to ensure that none of my downstream steps in pipeline are affected in anyways. Hence, I was trying out clumpify.sh and comparing fastp results with and without using clumpify.sh.
Case Study 1
fastp on the original reads (no clumpify pre-processing)
Case Study 2
clumpify.sh in1=R1.fastq.gz in2=R2.fastq.gz out1=clumped_R1.fastq.gz out2=clumped_R2.fastq.gz reorder=p
followed by fastp on the clumped reads
Observation: When I look at the fastp statistics, there are very minute differences.
Fastp results - Case Study 1
After filtering
total reads: 149.178736 M
total bases: 15.023535 G
Q20 bases: 14.815694 G (98.616568%)
Q30 bases: 14.394220 G (95.811144%)
GC content: 46.186597%
Filtering result
reads passed filters: 149.178736 M (95.630291%)
reads with low quality: 6.227876 M (3.992349%)
reads with too many N: 7.686000 K (0.004927%)
reads too short: 580.978000 K (0.372433%)
Fastp results - Case Study 2
After filtering
total reads: 149.174956 M
total bases: 15.022542 G
Q20 bases: 14.814667 G (98.616246%)
Q30 bases: 14.393230 G (95.810881%)
GC content: 46.186565%
Filtering result
reads passed filters: 149.174956 M (95.627868%)
reads with low quality: 6.228374 M (3.992668%)
reads with too many N: 7.688000 K (0.004928%)
reads too short: 584.258000 K (0.374536%)
The question: Given the above, should there be something I should be worried downstream and/or lookout for given the minute differences I have laid out above.
Thanks in advance.
Only reason you would want to use
clumpify.sh
is if you are interested in removing sequence duplicates in an alignment free manner. Otherwise you are not likely get a lot of benefit from this procedure.Thanks Brian, GenoMax and ATpoint