Samtools Dedup Documentation
2
2
Entering edit mode
12.1 years ago

Greetings,

I was asked by a collaborator for specific details about how I removed duplicate reads from a single end library (after alignment).

I used samtools rmdup -s my.bam.

This level of detail wasn't acceptable to my collaborator. Where can I find the specific details on how samtools rmdup works?

I didn't find an answer on : http://samtools.sourceforge.net/samtools.shtml

edit: The documentation in the link above was confusing to me and I was just hoping for some clarification

Any better documentation would be great!

samtools • 11k views
ADD COMMENT
3
Entering edit mode
12.1 years ago

rmdup for PE reads is pretty straightforward. It looks for identical external coordinates, meaning it only looks at the 5' start coordinates of the FR orientation pair-reads. Then it takes the pair with the highest mapping quality.

For SE reads, I've read that samtools also only looks for identical 5' start coordinates, not both start and end coordinates. I think the idea is that sequencers usually fall in quality towards the 3'. After mapping, duplicate reads have higher chance of mapping differentially towards the 3' end. So it only looks at the adapter trimmed 5' start for duplicates.

ADD COMMENT
0
Entering edit mode

Damian is right. You can also say that samtools dedup looks for unique start positions of the reads. As, most of the aligners or Users trim the 3' end of the reads before aligning, 2 reads that are duplicates may end up having different 3' coordinates but they will always have same 5' start positions. so we only use 5' positions. I am sure you must be knowing this but in case.

ADD REPLY
2
Entering edit mode
12.1 years ago

The samtools manual that you link to has this:

rmdup samtools rmdup [-sS] <input.srt.bam> <out.bam>

Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. In the paired-end mode, this command ONLY works with FR orientation and requires ISIZE is correctly set. It does not work for unpaired reads (e.g. two ends mapped to different chromosomes or orphan reads).

OPTIONS:
-s Remove duplicate for single-end reads. By default, the command works for paired-end reads only.
-S Treat paired-end reads and single-end reads.
ADD COMMENT
0
Entering edit mode

Yup I have read it carefully. I guess I am thick headed and just wanted some clarification.

ADD REPLY

Login before adding your answer.

Traffic: 2172 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6