Entering edit mode
7.1 years ago
blur
▴
280
Hi, I want to use PICARD tools markduplicates option, but after reading the manual I am still not sure I understand the method used. http://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates It reads: "The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file"
Does this mean duplicates are marked based on their chr+start position and the 5'-sequence? or does the tool take the full sequence into account by using the CIGAR data?
Thanks in advance.
Keeping in mind @ATPoint's note, if you do want to remove PCR/optical duplicates for other reasons then use Clumpify (A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ) It does not need the data to be aligned and works from sequences.
Will the answer to this question influence your decision to use it or not in any way?
Yes. Duplicate removal had influenced my results dramatically in the past.
Hope you do not want to remove duplicates from RNA-seq data, as the tags of your post suggest?
That is exactly why this operation is so dangerous. You better be sure that the removed duplicates are all artificial and not a natural effect of the high coverage.
There is a common myth floating around that "duplicates" are a synonym of "error". That is a remnant of the past when coverages were typically low.
I don't doubt you, but do you have a source for this? I am new to RNA seq and what I have read is inline with the "myth" you're referring to. I would like to know more about whether or not I should be removing duplicates.
You can google for papers (mostly newer ones) which used Unique Molecular Identifiers (UMIs) to investigate how many of the observed duplicates are actually based on PCR redundancy and which are based on coverage. The current consensus, from what I know, is that in targeted assays one generally does not remove duplicates as it would remove too many non-technical duplicates. I in fact know of no pipeline that would remove RNA-seq duplicates, here it is generally well accepted to go with the reads/counts as-observed rather than deduplicating the experiment.
See https://dnatech.genomecenter.ucdavis.edu/faqs/should-i-remove-pcr-duplicates-from-my-rna-seq-data/ for references