Question

Unable to remove duplicate sequences

0

Entering edit mode

9 months ago

juanluis0516 • 0

Hi,

I am trying to make a SARS-CoV2 assembly but I am stuck in the preprocessing of sequencing reads. When I analyse them with Fastqc I see that in the original fastqs there are a high number of duplicate reads. If it were RNA-Seq I wouldn't mind these duplicate reads, but it is to do a de novo assembly, so I would like to remove them so they don't interfere with the quality of the assembly. I have tried several methods to remove these duplicates, using fastp, clumplify.sh, picard MarkDuplicates, ... and I can't get rid of the reads. In the case of fastp it doesn't work, in the case of clumplify.sh it only removes the duplicates from one of the paired fastqs files, while the other one doesn't remove them, and in the case of MarkDuplicates, once I run it, and I get the fastqs back to analyse with Fastqc I get again a high number of duplicate reads. Does anyone know of an alternative to remove these duplicates, or could you help me with information or tutorials to check if I'm doing it right?

duplicate sequences Removing • 763 views

ADD COMMENT • link updated 9 months ago by michael.ante ★ 4.0k • written 9 months ago by juanluis0516 • 0

0

Entering edit mode

have tried several methods to remove these duplicates, using fastp, clumplify.sh, picard MarkDuplicates, ... and I can't get rid of the reads.

show us the command lines, show us some example of duplicated reads.

ADD REPLY • link 9 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

[edit] moved comment to answer

ADD REPLY • link 9 months ago by michael.ante ★ 4.0k

score 0 · Answer 1 · 2024-08-21

0

Entering edit mode

9 months ago

michael.ante ★ 4.0k

Fastqc uses the first 50 bp to compute the duplication rate (see here), whilst the tools you mention take the whole read.

If you have two reads differ from position 51 on, fastqc would call them duplicated but the tools at hand do not remove one of the reads.

ADD COMMENT • link 9 months ago by michael.ante ★ 4.0k