The GATK Mark the Duplicates at the end of their pipeline, after merging the BAMs .
In order to remove the optical duplicates and for each lane, I would have put this operation after the alignment with BWA for each lane/sample (= parallelization = faster)
Is there any reason to mark the duplicates at this position in their pipeline ?
optical duplicate=two spots, close to each other, mapping the same fragment.
Ok, then those should be very few. PCR duplicates can be many more and more serious. We did have some libraries with up to 70% PCR duplicates. Clearly no good libraries.
OK, the "PCR duplicates" is a good argument.