Hi, I am the author of fastp, a tool to provide ultra-fast all-in-one FASTQ preprocessing functions.
This tool has received 500+ stars in github (https://github.com/OpenGene/fastp), and has been cited for 40+ times since its paper published in Bioinformatics about 8 months ago.
Now I am considering to add a deduplication function to it. This may require some effort to implement it. So I think I should ask the users here, whether people need this feature.
You replies will be very appreciated. I will continue to improve this tool.
chen : Can I make an unrelated suggestion?
If you are looking for a new programming challenge then consider creating a data simulator that can generate data with UMI's. Think about creating data for single cells, cell types, 10X etc. AFAIK there is nothing available that can do this now.
I concur with @Devon's point below but the nature of the data necessitates use of extreme amounts of RAM (I have used over TB for NovaSeq data with
clumpify
).Thanks, I will consider your suggestion.
For deduplication, I think I can control the RAM usage to be less than 16G for processing even 1Tbp Illumina PE data.
After 7 months how is the landscape? Is there a tool for extract UMIs and deduplication on FASTQ level? I have workflow and I need to have deduplication before mapping and BAM?
They've recently added a gencore repository, which might be able to do that. I haven't used this yet, I just remember merging in the bioconda recipes recently.
Update: I guess this takes BAM files, so it's not relevant.
manekineko : If you need de-duping before mapping your best bet is still: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files from BBMap suite.
Hi, can you tell me if fastp effectively remove duplicates or just count them?
Cheers.