Question

How to understand differences in duplication rates between picard MarkDuplicates and fastp

0

Entering edit mode

8 months ago

Assa Yeroslaviz ★ 1.9k

Hi, I have tried to search for duplications in my data suing both the fastp and the picard tools. Unfortunaltely the difference between the two tools is really big. For example for one of the samples fastp show 0.9% while picard has more than 32% duplicated reads identified.

Both command used default parameters. The data is plate-based single-cell data with UMI. I could find out how fastp is regarding a read to be duplicated, which is quite relaxed. But I don't understans how picard states whether or not a read is an optical duplication or a PCR duplication.

I would appreciate it, if someone can explain the difference more clearly, or point me to where I can find it.

thanks

duplication-rate rna-seq fastp picard • 1.4k views

ADD COMMENT • link 8 months ago by Assa Yeroslaviz ★ 1.9k

1

Entering edit mode

I have tried to search for duplications in my data

and

Unfortunaltely the difference between the two tools is really big.

If you have not tried it, give clumpify.sh from BBMap suite a try. It allows alignment-free marking of duplicates and has many options to handle them. More info: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

ADD REPLY • link 8 months ago by GenoMax 151k

0

Entering edit mode

No I didn't try this tool yet. I will give it a go. thx

ADD REPLY • link 8 months ago by Assa Yeroslaviz ★ 1.9k

score 2 · Accepted Answer · 2024-08-28

The significant difference in duplication rates reported by Picard's MarkDuplicates and fastp is likely due to the distinct methodologies and definitions each tool uses to identify duplicated reads. Here's a breakdown of the main reasons behind the discrepancy:

1. Methodology and Stringency

Picard MarkDuplicates:
- Stringency: Picard is stringent in identifying duplicated reads. It primarily considers reads to be duplicates if they have the same mapping position (start position for paired-end reads) and the same sequence (or sequence of the UMI in UMI-aware modes).
- Types of Duplicates: Picard differentiates between optical duplicates (arising during imaging in the sequencing process) and PCR duplicates (arising during the amplification process). However, unless you specify, it doesn't treat UMIs specially, which means it could count PCR duplicates even when the UMI differs.
- UMI Handling: If you're using UMIs, Picard might require additional configuration or tools (e.g., Picard UmiAwareMarkDuplicatesWithMateCigar) to correctly account for UMIs when marking duplicates.
fastp:
- Stringency: fastp is more relaxed in its duplication identification process. It uses a lightweight algorithm that checks for sequence identity over a portion of the read (usually the first 30 bp by default) and may not consider mapping position.
- UMI Awareness: fastp is designed to be fast and general-purpose, but it might not be fully aware of UMIs by default. This means it could be ignoring UMIs when calculating duplication rates, resulting in much lower duplication estimates.
- Partial Sequence Checking: Since it only checks part of the read sequence, it may undercount duplicates, especially in single-cell data where reads are expected to be more similar due to lower diversity.

2. Plate-based Single-Cell Data with UMI

UMIs (Unique Molecular Identifiers): In single-cell sequencing, UMIs are crucial for distinguishing between true biological duplicates and technical duplicates. If Picard is not configured to recognize UMIs, it will count reads with the same mapping position as duplicates even if they have different UMIs, leading to a higher duplication rate.
Read Length and Complexity: If your reads are short or have low complexity, fastp might miss some duplicates that Picard would catch due to its partial sequence comparison approach.

3. Recommendations

UMI Handling in Picard: If you're using UMIs, you should ensure that Picard is configured to recognize them properly. You can use tools like Picard UmiAwareMarkDuplicatesWithMateCigar, which is specifically designed for handling UMIs and distinguishing true duplicates from distinct molecules.
Customization in fastp: If using fastp, you can adjust parameters to make its duplication detection more stringent, though it might still not be as rigorous as Picard.

4. Conclusion

The difference in reported duplication rates is primarily due to how each tool handles and defines duplicate reads, especially in the context of UMI-based single-cell sequencing. Picard, by default, is more likely to overestimate duplicates if not UMI-aware, while fastp might underestimate them due to its more relaxed criteria.