Hi,
I am trying to make a SARS-CoV2 assembly but I am stuck in the preprocessing of sequencing reads. When I analyse them with Fastqc I see that in the original fastqs there are a high number of duplicate reads. If it were RNA-Seq I wouldn't mind these duplicate reads, but it is to do a de novo assembly, so I would like to remove them so they don't interfere with the quality of the assembly. I have tried several methods to remove these duplicates, using fastp, clumplify.sh, picard MarkDuplicates, ... and I can't get rid of the reads. In the case of fastp it doesn't work, in the case of clumplify.sh it only removes the duplicates from one of the paired fastqs files, while the other one doesn't remove them, and in the case of MarkDuplicates, once I run it, and I get the fastqs back to analyse with Fastqc I get again a high number of duplicate reads. Does anyone know of an alternative to remove these duplicates, or could you help me with information or tutorials to check if I'm doing it right?
show us the command lines, show us some example of duplicated reads.
[edit] moved comment to answer