Hello, I am working with Whole Shotgun Metagenomic data, specifically paired-end reads, where each sample typically has a read depth of approximately 50-60 million reads. In some samples, particularly those with a low complexity microbiome, I've observed high levels of duplication. For instance, in one sample (sample_X), the duplication rates were notable: sample_X_R1 had a duplication rate of 58.0%, and sample_X_R2 had 56.8%.
My current objective is to remove these duplicated reads from my dataset. To accomplish this, I utilized clumpify.sh from the BBtools suite. Here's an example of how I applied it:
clumpify.sh \
in=sample_X_R1.fastq.gz \
in2=sample_X_R2.fastq.gz \
out=sample_X_dedup_R1.fastq.gz \
out2=sample_X_dedup_R2.fastq.gz \
dedupe=t
After deduplication, I proceeded to assess the quality of my reads using FastQC. Interestingly, FastQC revealed varying levels of duplication post-deduplication: sample_X_dedup_R1
showed a duplication rate of 28.0%, whereas sample_X_dedup_R2
exhibited 44.9%.
I understand that clumpify.sh
removes duplicates only when both forward and reverse reads match. However, I'm curious as to why the forward reads consistently show higher levels of deduplication compared to the reverse reads. Any insights on this observation would be greatly appreciated.
Don't know if there is going to be a logical explanation here. It may just be a characteristics of your data (perhaps the tagmentation site on 5'end has some sequence bias).
As you noted the fragments you ended up with should be unique (as far as the parts that were sequenced).
BBMap suite has a tool called
tadpole.sh
which can extend/assemble reads such as these. You may be able to build larger representations of these genomic segments. Perhaps not something you are interested in though. Guide here: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/tadpole-guide/