I have some paired-end sequencing data that has a significant number of pairs that overlap due to small insert sizes. In my experience, merging the read pairs (and recalibrating with bbmap) results in better alignments. However, when it comes to the PCR duplicate removal step of the merged read pairs, I want to identify those alignments where both the 5' and 3' ends are identical, and none of the commonly used tools (samtools, sambamba, picard) appears to have this feature.
My question is, how would you filter for these overlapping read pairs that have been merged prior to alignment? I don't want to treat them as single-end reads for PCR duplicate removal as I may end up discarding information unnecessarily.
Definitely agreed. Therefore, I strongly recommend to choose that approach/tool you feel most comfortable with and then proceed with the analysis to avoid wasting time on the duplicate issue.
I see your point, but I'm still interested in how to solve this problem. It only requires identifying alignments with identical 5' and 3' ends, so I thought someone here might know a neat way to do it.
Yes, I am aware of clumpify and do like it, but sometimes when we sequence more of our sample at a later date, often substantially more, it is more practical to just merge bam files of all the lanes of sequencing than go back to the fastqs, merge them all, deduplicate and map again.
From the same library as in the first sequencing run? Otherwise removing duplicates after merging wouldn't be correct.
fin swimmer
Yes, same library... I'm not that bad