Hello,
After searching in this user group and analyzing the output of picard MarkDuplicates, it seems that picard does not output which of the reads it determined to be optical duplicates. That is, picard gives the number of optical and total duplicates in their metric file, but it only marks in its output bam file which reads are duplicates (optical or otherwise), and not which of those are optical duplicates.
I was requested to find a way to remove the optical duplicates and the only way I can think of doing this is to change the picard source code itself to output such a flag into the bam file (which could then be used to easily filter on). Before embarking on this task however, I'd like to know if someone has already done this and would this person be so kind as to share the code to spare me the trouble?
Thanks a lot!!!
Anna
Unfortunately, I expect you'll need to tweak the source code. Since the BAM spec doesn't even distinguish between optical and PCR duplicates I doubt anyone considered someone wanting to do this. Out of curiosity, what sort of situation caused you to need this?
I'm not sure, actually, because I don't have access to the science problem in this instance!! I work with someone who does sequencing for a fee, and one of her clients from France would like to remove the optical duplicates from the sequences she generated because they're excessive (without the optical duplicates the ENCODE metric 'pcr bottlenecking' would be in the desired 'mild bottlenecking' range but with the optical duplicates they're in the undesirable 'moderate bottlenecking' range). I'm just helping her out.
You'll will have to sort the read by names, create a group with the reads in the same group/lane/tile having the same position on the genome and the same cigar string, extract the X/Y positions of the reads on the flow-cell (http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers), retrieve those too close together and keep the one with the highest quality.
Thanks, Pierre! That's probably a better solution than to customize MarkDuplicates!
But MarkDuplicates invokes the code doing that job: see line 50 https://github.com/broadinstitute/picard/blob/master/src/java/picard/sam/AbstractDuplicateFindingAlgorithm.java . May be you'll 'just' have to subclass that class to do the job. You're problem is to disable the code removing the PCR dups.
You've been so helpful, Pierre, one last question... would the distance be euclidean:
sqrt((x2-x1)**2 + (y2-y1)**2) <= 100
(OPTICAL_DUPLICATE_PIXEL_DISTANCE
)?That would be a simple script. My customer would be very happy!! Thanks again!
see https://github.com/broadinstitute/picard/blob/master/src/java/picard/sam/AbstractDuplicateFindingAlgorithm.java#L176
Pierre, I'm CONVINCED that the MarkDuplicates code is wrong based on your description!!
When I wrote a script based on your description, I got only ~1k optical duplicates as opposed to MarkDuplicates ~3 million. I then wrote a very simple script that simply counts how many duplicates share the same tile, and that was < 4k. The overall number of duplicates matched (~7 million). I'm convinced my script is right, it's such a simple one. I'm willing to share my code, is there a way to attach a file here?
http://gist.github.com ?
I'm attaching a remove optical duplicates script, and a count tile duplicates script.
Both take as input a sam file sorted on chr and startPos. They also assume that when the sequence name is parsed by
:
then the tile is the 5th field, x the 6th and y the 7th (e.g.HWI-ST1318:119:H89A3ADXX:1:2209:1705:6933
, wheretile
is '2209',x
is '1705' andy
is '6933'). Finally, they assume that the file is for a single lane, as I was working with such files.The remove optical duplicates code here:
The count tile duplicates code is here: