How would you filter PCR duplicates for merged paired-end reads
2
0
Entering edit mode
6.1 years ago
hohoku • 0

I have some paired-end sequencing data that has a significant number of pairs that overlap due to small insert sizes. In my experience, merging the read pairs (and recalibrating with bbmap) results in better alignments. However, when it comes to the PCR duplicate removal step of the merged read pairs, I want to identify those alignments where both the 5' and 3' ends are identical, and none of the commonly used tools (samtools, sambamba, picard) appears to have this feature.

My question is, how would you filter for these overlapping read pairs that have been merged prior to alignment? I don't want to treat them as single-end reads for PCR duplicate removal as I may end up discarding information unnecessarily.

alignment PCR duplicate BAM • 3.4k views
ADD COMMENT
1
Entering edit mode
6.1 years ago
hohoku • 0

I found that the software Paleomix has the exact tool I was looking for.

paleomix rmdup_collapsed --remove-duplicates < sorted.bam > < out.bam >

ADD COMMENT
0
Entering edit mode
6.1 years ago

Hello,

there are several approaches to identify PCR duplicates in paired end sequencing:

  • compare 5' mapping positions of the read paires
  • compare the most 5' mapping positions of the read paires taking clipped bases into account
  • compare the sequence of the read paires

In my experience the results are more or less the same.

If working with merged overlapping reads one has the problem that there is a mixture of paired and single reads in the alignment. This is why I prefer removing duplicates based on there sequence prior merging the reads. A tool that can do this is clumpify.sh from bbtools:

$ clumpify.sh in=in_R1.fastq.gz in2=in_R2.fastq.gz out=out_R1.fastq.gz out2=out_R2.fastq.gz dedupe

fin swimmer

ADD COMMENT
0
Entering edit mode

In my experience the results are more or less the same

Definitely agreed. Therefore, I strongly recommend to choose that approach/tool you feel most comfortable with and then proceed with the analysis to avoid wasting time on the duplicate issue.

ADD REPLY
0
Entering edit mode

I see your point, but I'm still interested in how to solve this problem. It only requires identifying alignments with identical 5' and 3' ends, so I thought someone here might know a neat way to do it.

ADD REPLY
0
Entering edit mode

Yes, I am aware of clumpify and do like it, but sometimes when we sequence more of our sample at a later date, often substantially more, it is more practical to just merge bam files of all the lanes of sequencing than go back to the fastqs, merge them all, deduplicate and map again.

ADD REPLY
0
Entering edit mode

but sometimes when we sequence more of our sample at a later date

From the same library as in the first sequencing run? Otherwise removing duplicates after merging wouldn't be correct.

fin swimmer

ADD REPLY
0
Entering edit mode

Yes, same library... I'm not that bad

ADD REPLY

Login before adding your answer.

Traffic: 1667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6