To remove or to keep duplicates in alignment of NGS paired reads to a draft genome (set of assembled contigs)
0
1
Entering edit mode
10.0 years ago
misaghb ▴ 20

I have a SAM file containing alignments of NGS paired reads against a set of assembled contigs (de novo).

If I might need to infer some information about the coverage of these contigs from the SAM file (e.g. inferring that contigs are unique ones or collapsed repeats with some copy numbers (repeat counts) in the genome), should I mark and remove duplicates either from SAM file (using Picard MarkDuplicates or SAMtools redup) or from reads sequences (FastUniq) or just keep everything as it is to not loose anything important that might affect downstream analyses?

Would you please share your opinion with me and let me know pros and cons of duplicate removal in this case?

Thanks.

sam bam duplicates • 4.8k views
ADD COMMENT
1
Entering edit mode

if you'd need to assess the sequencing and library construction biases, as well as the assembly quality - you will probably need that data. Removing it, however, would likely to result in the downstream analyses speedup (less data to crunch, less space), while takes away the ability to assess the true variance. In addition to that, the same (i.e. duplicated in the alignment sense) reads are not necessarily represent the same DNA stretch -- when making a decision you need to account for that.

ADD REPLY
0
Entering edit mode

I am concerned in just 2 downstream analyses: (1) scaffolding the contigs, (2) estimating the integer copy number (repeat count) of each contig. I suspect that removing duplicates from the SAM file might have good effect (mainly speed-up) on (1) and very bad effect on (2). Am I right?

ADD REPLY
1
Entering edit mode

DNA or RNA?

ADD REPLY
0
Entering edit mode

DNA sequences

ADD REPLY

Login before adding your answer.

Traffic: 1596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6