Question

How would you filter PCR duplicates for merged paired-end reads

0

Entering edit mode

6.5 years ago

hohoku • 0

I have some paired-end sequencing data that has a significant number of pairs that overlap due to small insert sizes. In my experience, merging the read pairs (and recalibrating with bbmap) results in better alignments. However, when it comes to the PCR duplicate removal step of the merged read pairs, I want to identify those alignments where both the 5' and 3' ends are identical, and none of the commonly used tools (samtools, sambamba, picard) appears to have this feature.

My question is, how would you filter for these overlapping read pairs that have been merged prior to alignment? I don't want to treat them as single-end reads for PCR duplicate removal as I may end up discarding information unnecessarily.

alignment PCR duplicate BAM • 3.7k views

ADD COMMENT • link 6.5 years ago by hohoku • 0

0

Entering edit mode

6.5 years ago

finswimmer 16k

Hello,

there are several approaches to identify PCR duplicates in paired end sequencing:

compare 5' mapping positions of the read paires
compare the most 5' mapping positions of the read paires taking clipped bases into account
compare the sequence of the read paires

In my experience the results are more or less the same.

If working with merged overlapping reads one has the problem that there is a mixture of paired and single reads in the alignment. This is why I prefer removing duplicates based on there sequence prior merging the reads. A tool that can do this is clumpify.sh from bbtools:

$ clumpify.sh in=in_R1.fastq.gz in2=in_R2.fastq.gz out=out_R1.fastq.gz out2=out_R2.fastq.gz dedupe

fin swimmer

ADD COMMENT • link 6.5 years ago by finswimmer 16k

0

Entering edit mode

In my experience the results are more or less the same

Definitely agreed. Therefore, I strongly recommend to choose that approach/tool you feel most comfortable with and then proceed with the analysis to avoid wasting time on the duplicate issue.

ADD REPLY • link 6.5 years ago by ATpoint 87k

0

Entering edit mode

I see your point, but I'm still interested in how to solve this problem. It only requires identifying alignments with identical 5' and 3' ends, so I thought someone here might know a neat way to do it.

ADD REPLY • link 6.5 years ago by hohoku • 0

0

Entering edit mode

Yes, I am aware of clumpify and do like it, but sometimes when we sequence more of our sample at a later date, often substantially more, it is more practical to just merge bam files of all the lanes of sequencing than go back to the fastqs, merge them all, deduplicate and map again.

ADD REPLY • link 6.5 years ago by hohoku • 0

0

Entering edit mode

but sometimes when we sequence more of our sample at a later date

From the same library as in the first sequencing run? Otherwise removing duplicates after merging wouldn't be correct.

fin swimmer

ADD REPLY • link 6.5 years ago by finswimmer 16k

0

Entering edit mode

Yes, same library... I'm not that bad

ADD REPLY • link 6.5 years ago by hohoku • 0

score 1 · Accepted Answer · 2018-11-12

1

Entering edit mode

6.5 years ago

hohoku • 0

I found that the software Paleomix has the exact tool I was looking for.

paleomix rmdup_collapsed --remove-duplicates < sorted.bam > < out.bam >

ADD COMMENT • link 6.5 years ago by hohoku • 0