Remove optical duplicates from bam file

Entering edit mode

10.9 years ago

Anna S ▴ 520

Hello,

After searching in this user group and analyzing the output of picard MarkDuplicates, it seems that picard does not output which of the reads it determined to be optical duplicates. That is, picard gives the number of optical and total duplicates in their metric file, but it only marks in its output bam file which reads are duplicates (optical or otherwise), and not which of those are optical duplicates.

I was requested to find a way to remove the optical duplicates and the only way I can think of doing this is to change the picard source code itself to output such a flag into the bam file (which could then be used to easily filter on). Before embarking on this task however, I'd like to know if someone has already done this and would this person be so kind as to share the code to spare me the trouble?

Thanks a lot!!!

Anna

sequencing • 14k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.9 years ago by Anna S ▴ 520

Entering edit mode

Unfortunately, I expect you'll need to tweak the source code. Since the BAM spec doesn't even distinguish between optical and PCR duplicates I doubt anyone considered someone wanting to do this. Out of curiosity, what sort of situation caused you to need this?

ADD REPLY • link updated 4.4 years ago by Ram 45k • written 10.9 years ago by Devon Ryan 105k

Entering edit mode

I'm not sure, actually, because I don't have access to the science problem in this instance!! I work with someone who does sequencing for a fee, and one of her clients from France would like to remove the optical duplicates from the sequences she generated because they're excessive (without the optical duplicates the ENCODE metric 'pcr bottlenecking' would be in the desired 'mild bottlenecking' range but with the optical duplicates they're in the undesirable 'moderate bottlenecking' range). I'm just helping her out.

ADD REPLY • link updated 4.4 years ago by Ram 45k • written 10.9 years ago by Anna S ▴ 520

Entering edit mode

You'll will have to sort the read by names, create a group with the reads in the same group/lane/tile having the same position on the genome and the same cigar string, extract the X/Y positions of the reads on the flow-cell (http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers), retrieve those too close together and keep the one with the highest quality.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Pierre Lindenbaum 166k

Entering edit mode

Thanks, Pierre! That's probably a better solution than to customize MarkDuplicates!

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Anna S ▴ 520

Entering edit mode

But MarkDuplicates invokes the code doing that job: see line 50 https://github.com/broadinstitute/picard/blob/master/src/java/picard/sam/AbstractDuplicateFindingAlgorithm.java . May be you'll 'just' have to subclass that class to do the job. You're problem is to disable the code removing the PCR dups.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Pierre Lindenbaum 166k

Entering edit mode

You've been so helpful, Pierre, one last question... would the distance be euclidean: sqrt((x2-x1)**2 + (y2-y1)**2) <= 100 (OPTICAL_DUPLICATE_PIXEL_DISTANCE)?

That would be a simple script. My customer would be very happy!! Thanks again!

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Anna S ▴ 520

Entering edit mode

see https://github.com/broadinstitute/picard/blob/master/src/java/picard/sam/AbstractDuplicateFindingAlgorithm.java#L176

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Pierre Lindenbaum 166k

Entering edit mode

Pierre, I'm CONVINCED that the MarkDuplicates code is wrong based on your description!!

When I wrote a script based on your description, I got only ~1k optical duplicates as opposed to MarkDuplicates ~3 million. I then wrote a very simple script that simply counts how many duplicates share the same tile, and that was < 4k. The overall number of duplicates matched (~7 million). I'm convinced my script is right, it's such a simple one. I'm willing to share my code, is there a way to attach a file here?

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Anna S ▴ 520

Entering edit mode

http://gist.github.com ?

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Pierre Lindenbaum 166k

Entering edit mode

I'm attaching a remove optical duplicates script, and a count tile duplicates script.

Both take as input a sam file sorted on chr and startPos. They also assume that when the sequence name is parsed by : then the tile is the 5th field, x the 6th and y the 7th (e.g. HWI-ST1318:119:H89A3ADXX:1:2209:1705:6933, where tile is '2209', x is '1705' and y is '6933'). Finally, they assume that the file is for a single lane, as I was working with such files.

The remove optical duplicates code here:

	#!/usr/local/bin/python

	"RemoveOpticalDuplicates.py"
	"Author: Anna Salzberg"

	import sys
	import string
	import os
	import time
	import math

	def convertStr(s):
	"""Convert string to either int or float."""
	try:
	ret = int(s)
	except ValueError:
	try :
	ret = float(s);
	except ValueError :
	ret = -1;

	return ret

	class Parameter :
	def __init__(self, argv) :
	self.argv = argv;
	self.inFN = "";
	self.opticalDuplicatePixelDistance = 100;

	def getOpticalDuplicatePixelDistance(self) :
	return self.opticalDuplicatePixelDistance

	def getInFN(self) :
	return self.inFN;

	def getOutFN(self) :
	return self.outFN;

	# Prints usage
	def usage(self):
	print "Usage: python RemoveOpticalDuplicates.py [-d optical_duplciate_pixel_distance] <input file> "
	print "where the input is a sam file sorted by chr position. Default optical_duplciate_pixel_distance: 100"
	print "Example: python RemoveOpticalDuplicates.py sampleA.bwa.sorted.sam > sampleA.bwa.rmoptdup.sam"

	sys.exit(1)

	def checkArgs(self) :
	i = 1;
	argc = len(sys.argv);
	errorMsg = "";
	while i < argc and errorMsg == "":
	if self.argv[i] == '-d' :
	i += 1;
	if i < argc and not self.argv[i].startswith('-') :
	self.opticalDuplicatePixelDistance = convertStr(self.argv[i]);
	if self.opticalDuplicatePixelDistance < 1 :
	errorMsg = 'Error: optical duplicate pixel distance must be an integer >= 1'
	else :
	errorMsg = 'Error: optical duplicate pixel distance expected after -d flag'
	elif i == argc-1 :
	self.inFN = self.argv[i];
	else :
	errorMsg = 'Error: unknown flag: ' + self.argv[i];

	i += 1;

	if errorMsg == "" :
	if self.inFN == "" :
	errorMsg = 'Error: insufficient arguments given'

	if errorMsg <> "" :
	print errorMsg;
	self.usage();



	def removeOpticalDuplicates(param) :
	inFN = param.getInFN();
	outFile = sys.stdout;
	optDist = param.getOpticalDuplicatePixelDistance()

	try:
	inFile = open (inFN, 'r')
	except:
	print "Unable to open file " + inFN
	sys.exit(-1)


	tile_cigarToDupDict = {}
	curChr = ""
	curStartPos = ""

	first = True;
	line = inFile.readline();
	while (1) :
	if not line: break
	nextLine = inFile.readline();
	if not line.startswith("@") :
	tokens = line.split();
	chr = tokens[2];
	startPos = tokens[3];
	mapq = tokens[4]
	cigar = tokens[5];
	subtokens = tokens[0].split(":");
	tile = subtokens[4];
	x = subtokens[5]
	y = subtokens[6]
	tile_cigar = tile + "_" + cigar;
	if first :
	curChr = chr;
	curStartPos = startPos
	first = False;

	if chr <> "*" and (chr == curChr and startPos == curStartPos) :
	dups = tile_cigarToDupDict.get(tile_cigar, []);
	dups.append([x, y, mapq, line]);
	tile_cigarToDupDict[tile_cigar] = dups;

	if chr <> "*" and (chr <> curChr or startPos <> curStartPos or not nextLine) :
	for key in tile_cigarToDupDict.keys() :
	dups = tile_cigarToDupDict[key];
	dups.sort();
	if len(dups) == 1 :
	nondupline = dups[0][3]
	outFile.write(nondupline);
	elif len(dups) > 1 :
	processed = []
	for k in range(0, len(dups)) :
	processed.append(False)

	for i in range(0, len(dups)) :
	if not processed[i] :
	xi = convertStr(dups[i][0])
	yi = convertStr(dups[i][1])
	mapqi = convertStr(dups[i][2])
	processed[i] = True;

	best = i;
	bestMapq = mapqi;
	for j in range(i+1, len(dups)) :
	if not processed[j] :
	xj = convertStr(dups[j][0])
	yj = convertStr(dups[j][1])
	mapqj = convertStr(dups[j][2])

	if abs(xj-xi) > optDist : break;
	if abs(yj-yi) <= optDist :
	processed[j] = True
	if mapqj > bestMapq :
	best = j;
	bestMapq = mapqj;
	bestline = dups[best][3]
	outFile.write(bestline);
	tile_cigarToDupDict = {}
	tile_cigarToDupDict[tile_cigar] = [[x, y, mapq, line]]

	if chr <> "*" and not nextLine and (chr <> curChr or startPos <> curStartPos) :
	outFile.write(line);
	if chr == "*" :
	outFile.write(line);
	curChr = chr
	curStartPos = startPos;

	line = nextLine;

	inFile.close();
	outFile.close();


	# Execute as application
	if __name__ == '__main__' :
	param = Parameter(sys.argv)
	param.checkArgs();

	removeOpticalDuplicates(param)

view raw removeOptDups.py hosted with ❤ by GitHub

The count tile duplicates code is here:

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Anna S ▴ 520

Entering edit mode

10.9 years ago

Anna S ▴ 520

I contacted the Broad Institute, and Nils Homer (samtools team) informed me that they already knew about one bug relating to this. I tested the fix, and this bug affected not only the number of optical duplicates but also the estimated library size statistic. To get the bug fix you need version >= Picard 1.120.

Unfortunately I still think there are additional bug(s), and I'm working with Nils on this. I'll let you know what else I find out!!

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Anna S ▴ 520

Entering edit mode

I have found another bug with MarkDuplicates!! I suspect there's more too but I've had a gazillion deadlines in the past several days and so I haven't looked into it yet; perhaps next week sometime.

In particular, I ran bwa and then picard MarkDuplicates v1.122 (which incorporates the fix above) on lane 1 and lane 2 files separately, and I then did the same for the merged fastq files. I got the picard MarkDuplicates results below. Note that the overall % duplication is higher in the merged file than in the individual lane ones (to be expected), but that the OPTICAL duplicates is also higher, that is, it is more than the sum of the OPTICAL duplicates of lanes 1 and 2 (NOT to be expected).

In other words, it is expected that the overall duplication is higher than the sum of the ones in lanes 1 and 2 due to interaction between the reads of lanes 1 and 2 (for ex., read1 from lane1 could have no duplicates in lane1, and read2 from lane2 could have no duplicate in lane2 but in the merged file they could be duplicates of each other). However, by definition, there should be no interaction in the OPTICAL duplicates between lanes 1 and 2, and therefore the results below suggest that the MarkDuplicates optical duplication algorithm is not checking for the lane.

Sample        LIBRARY               UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED    UNMAPPED_READS    UNPAIRED_READ_DUPLICATES    READ_PAIR_DUPLICATES   READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION    ESTIMATED_LIBRARY_SIZE
A_merged      Unknown Library       497938                  21072701               939669            326848                      9821728                3875890                         0.46831                18726206
A_L1          Unknown Library       247592                  10518632               458565            141958                      3109570                1074927                         0.298856               18640119
A_L2          Unknown Library       250347                  10554068               481105            147968                      3130326                1080576                         0.30005                18605087

ADD REPLY • link updated 4.4 years ago by Ram 45k • written 10.9 years ago by Anna S ▴ 520

Entering edit mode

I have heard back from Nils Homer regarding the lane issue above, and his answer is that MarkDuplicates requires AddOrReplaceReadGroups (or equivalent) to work properly since it extracts the tile,x and y from the name but not the lane. Here is his full answer:

Thanks Anna for the example set. I have observed a few things regarding this issue

The first is that we do not extract the lane # from the read name, only tile, x-coordinate, and y-coordinate. You can see this in the code here if you are interested: https://github.com/broadinstitute/picard/blob/master/src/java/picard/sam/markduplicates/util/OpticalDuplicateFinder.java#L84-L104

Secondly, we also do not retrieve either the barcode information or library identifier in the read name, since they themselves are not embedded in the read name. Both barcode and library identifier are also important to condition upon when searching for optical duplicates, or duplicates in general.

This brings us to where do we expect to retrieve this information? We use the read group header lines to capture lane, barcode, library, flowcell (for Illumina) and other information for specific sets or groups of reads. If this information is given, which I recommend that as a best practice it should, MarkDuplicates will behave as you expect. I believe it is much more robust to annotate these metadata in the header rather than rely on parsing read names wholly, since read name structures do change, albeit infrequently.

I would recommend adding read groups to your SAM header within your pipeline. We use FastqToSam or IlluminaBasecallsToSam to set the read group appropriately depending on our inputs. In Picard, we also have tools like AddOrReplaceReadGroups that can help you add read groups prior to marking duplicates.

Nils

ADD REPLY • link updated 4.4 years ago by Ram 45k • written 10.9 years ago by Anna S ▴ 520

Entering edit mode

Thanks Anna for the great detective work and sharing back what you have learned.

You are like The Little Engine That Could - keep on pushing the issue until it is clarified!

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Istvan Albert 103k

Entering edit mode

Istvan, I wish I were even more so because I'm convinced I have found yet another bug in their optical duplicates where their number is too high (higher even than the duplicates with the same tile!), but it's awkward for me to report yet a third problem to Dr. Homer. I think at this point I'd probably have to look at their code and pinpoint the problem, sigh.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Anna S ▴ 520

Entering edit mode

I am sure they want to know as well - and that they would not want to knowingly run code that is not right.

ADD REPLY • link 10.9 years ago by Istvan Albert 103k

Entering edit mode

I have just counted a small example BY HAND, and of course it matched my script (!, haha). When I count the number of duplicates in the same tile for a small example, I get a total of 12. However, picard MarkDuplicates finds 25 optical duplicate pairs, that is 50. (the total number of duplicates from picard is correct).

Ok guys, I have just sent NIls Homer this... I'll let you know what he says.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Anna S ▴ 520

Entering edit mode

Thanks for all your work, Anna! Sorry to bug you but is there any news on this? Can we now trust MarkDuplicates in, for example, version 2.01?

ADD REPLY • link 9.5 years ago by terdon ▴ 430

Entering edit mode

Yes - I'm just about to fire off MarkDuplicates but if it's wrong, then I might have a problem :P

Obviously we want to mark optical duplicates and PCR duplicates as different somehow (new tag?), or maybe repurpose the Vendor Failed QC flag, since that never seems to get set anyway.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 9.5 years ago by John 13k

Entering edit mode

Hi Anna，

Would you mind telling me where I can get the latest version of picard (may be the v1.122 you have mentioned)? The latest version of it on http://sourceforge.net/projects/picard/ was v1.119.

Thank you!

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Emma ▴ 10

Entering edit mode

You can get newer versions here.

ADD REPLY • link updated 4.4 years ago by Ram 45k • written 10.4 years ago by Devon Ryan 105k

Entering edit mode

@Devon Ryan, thanks for the link ;)

ADD REPLY • link 10.4 years ago by Emma ▴ 10

Entering edit mode

The picard fix decreases the number of optical duplicates by 27-36%. As I said, I still think the number they report is too high though.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Anna S ▴ 520

Entering edit mode

10.9 years ago

Pierre Lindenbaum 166k

http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates

REMOVE_DUPLICATES=Boolean If true do not write duplicates to the output file instead of writing them with appropriate flags set. Default value: false.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 10.9 years ago by Pierre Lindenbaum 166k

Entering edit mode

I think what you propose removes all duplicates, not just the optical duplicates? MarkDuplicates does work to determine which of the duplicates are optical but this has very limited value as it does not output into its bam file an optical flag. My customer would like to remove only the optical duplicates but leave the 'pcr' ones for analysis. Thanks if you have any tips for this!!

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Anna S ▴ 520

Entering edit mode

OK just optical dups. sorry, I didn't understood your question at first sight.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by Pierre Lindenbaum 166k