Is there an efficient utility out there for collapsing BAM files by sequence? I.e. keep only one of each sequence read (ideally with some constraints on which quality score reads to keep when there are multiple identical sequence reads with distinct quality scores)? thanks.
To clarify, I'd like to be able to only remove duplicates if their sequences are identical - so keep reads with the same alignment position if they have distinct sequences.
The downside to using this is that the BAM files generated could only be used with GATK's tools. And (correct me if I'm wrong) but I thought this was part of the GATK v2 code that isn't open.
At least there is a specification.