Hey there,
I am curious about the deduplication aspect of treating the sequencing reads. So far, I did it a handful of times and always helped out in the end but I am aware that there is a debate on whether this is actually biologically correct to do or not.
What I usually do is to map the reads, get the bam file, and submit it to picard to MarkDuplicates
.
What I want to know are 3 questions:
- How do people deduplicate by mapping position using a psl file?
- When would you say that deduplication is too risky?
- I personally developed a tool (but there are some already) to remove duplicates by sequence identity. Without going in the details of the algorithm, I can tell you that the intersection of the removed reads between picard and my script is 99% (not 100%, though, there are some different reads). Is this approach theoretically correct?
To add to this:
I recommend optical duplicate removal for all HiSeq platforms, for any kind of project in which you expect high library complexity (such as WGS). By optical duplicate, I mean removal of duplicates with very close coordinates on the flow cell. And by duplicate removal, I mean removing all duplicate copies except one. Whether you should remove non-positionally-correlated duplicates (such as PCR duplicates) is more experiment-specific. And whether you should do any form of duplicate removal on low-complexity libraries is also experiment-specific, as you'll get false positives even when restricting duplicate detection to nearby clusters.
http://core-genomics.blogspot.com/2016/05/increased-read-duplication-on-patterned.html and this recent thread: Duplicates on Illumina
I meant to say alignment to an external reference (corrected above). Mark duplicates functionality is an extension of the
clumpify
algorithm (details are in this thread: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ) that identifies sequences with similar sequences from a file. Those are rearranged to be near each other, which leads to efficient compression of the data files saving ~25% or so space. Optical duplicates are being marked by taking into account x,y coordinate positions of the read clusters and positional neighborhood space.Thank you very much!