Unable to remove duplicate sequences
1
0
Entering edit mode
3 months ago

Hi,

I am trying to make a SARS-CoV2 assembly but I am stuck in the preprocessing of sequencing reads. When I analyse them with Fastqc I see that in the original fastqs there are a high number of duplicate reads. If it were RNA-Seq I wouldn't mind these duplicate reads, but it is to do a de novo assembly, so I would like to remove them so they don't interfere with the quality of the assembly. I have tried several methods to remove these duplicates, using fastp, clumplify.sh, picard MarkDuplicates, ... and I can't get rid of the reads. In the case of fastp it doesn't work, in the case of clumplify.sh it only removes the duplicates from one of the paired fastqs files, while the other one doesn't remove them, and in the case of MarkDuplicates, once I run it, and I get the fastqs back to analyse with Fastqc I get again a high number of duplicate reads. Does anyone know of an alternative to remove these duplicates, or could you help me with information or tutorials to check if I'm doing it right?

duplicate sequences Removing • 407 views
ADD COMMENT
0
Entering edit mode

have tried several methods to remove these duplicates, using fastp, clumplify.sh, picard MarkDuplicates, ... and I can't get rid of the reads.

show us the command lines, show us some example of duplicated reads.

ADD REPLY
0
Entering edit mode

[edit] moved comment to answer

ADD REPLY
0
Entering edit mode
3 months ago
michael.ante ★ 3.9k

Fastqc uses the first 50 bp to compute the duplication rate (see here), whilst the tools you mention take the whole read.

If you have two reads differ from position 51 on, fastqc would call them duplicated but the tools at hand do not remove one of the reads.

ADD COMMENT

Login before adding your answer.

Traffic: 1754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6