I have a SAM file containing alignments of NGS paired reads against a set of assembled contigs (de novo).
If I might need to infer some information about the coverage of these contigs from the SAM file (e.g. inferring that contigs are unique ones or collapsed repeats with some copy numbers (repeat counts) in the genome), should I mark and remove duplicates either from SAM file (using Picard MarkDuplicates or SAMtools redup) or from reads sequences (FastUniq) or just keep everything as it is to not loose anything important that might affect downstream analyses?
Would you please share your opinion with me and let me know pros and cons of duplicate removal in this case?
Thanks.
if you'd need to assess the sequencing and library construction biases, as well as the assembly quality - you will probably need that data. Removing it, however, would likely to result in the downstream analyses speedup (less data to crunch, less space), while takes away the ability to assess the true variance. In addition to that, the same (i.e. duplicated in the alignment sense) reads are not necessarily represent the same DNA stretch -- when making a decision you need to account for that.
I am concerned in just 2 downstream analyses: (1) scaffolding the contigs, (2) estimating the integer copy number (repeat count) of each contig. I suspect that removing duplicates from the SAM file might have good effect (mainly speed-up) on (1) and very bad effect on (2). Am I right?
DNA or RNA?
DNA sequences