Question

To remove or to keep duplicates in alignment of NGS paired reads to a draft genome (set of assembled contigs)

1

Entering edit mode

10.4 years ago

misaghb ▴ 20

I have a SAM file containing alignments of NGS paired reads against a set of assembled contigs (de novo).

If I might need to infer some information about the coverage of these contigs from the SAM file (e.g. inferring that contigs are unique ones or collapsed repeats with some copy numbers (repeat counts) in the genome), should I mark and remove duplicates either from SAM file (using Picard MarkDuplicates or SAMtools redup) or from reads sequences (FastUniq) or just keep everything as it is to not loose anything important that might affect downstream analyses?

Would you please share your opinion with me and let me know pros and cons of duplicate removal in this case?

Thanks.

sam bam duplicates • 5.0k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by misaghb ▴ 20

1

Entering edit mode

if you'd need to assess the sequencing and library construction biases, as well as the assembly quality - you will probably need that data. Removing it, however, would likely to result in the downstream analyses speedup (less data to crunch, less space), while takes away the ability to assess the true variance. In addition to that, the same (i.e. duplicated in the alignment sense) reads are not necessarily represent the same DNA stretch -- when making a decision you need to account for that.

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

I am concerned in just 2 downstream analyses: (1) scaffolding the contigs, (2) estimating the integer copy number (repeat count) of each contig. I suspect that removing duplicates from the SAM file might have good effect (mainly speed-up) on (1) and very bad effect on (2). Am I right?

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by misaghb ▴ 20

1

Entering edit mode

DNA or RNA?

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Asaf 10k

0

Entering edit mode

DNA sequences

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by misaghb ▴ 20