Why do we remove duplicates from BAM files while using Samtools? When we have paired end data we can remove duplicates as a fragment OR as a pair. How do each of these methods differ?
Why do we remove duplicates from BAM files while using Samtools? When we have paired end data we can remove duplicates as a fragment OR as a pair. How do each of these methods differ?
I would personally recommend using Picard for marking or removing duplicates. If you have a paired data, then both reads for a pair will be used to select duplicates. In this case, if there is another pair that has both of its reads aligning at the same exact location as this pair, then one of these would be marked as duplicates. For fragment reads, location of only one read will be used to mark the duplicates.
Here is a example. Assuming for fragment data, there are 5 reads that align exactly at the same location. 4 of them will be marked duplicates and 1 of them will be kept for further use. The best read (least mismatches or best mapping quality) will be chosen by Picard or samtools mark duplicate module so you dont need to worry about it. Also, marking duplicates is done at library level. so if you have two libraries their duplicates will be marked separately. If there are reads in two libraries that align at the same position then wont be marked as duplicates.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
On seqanswer there is a thread that could interest: Samtools's rmdup vs. Picard's MarkDuplicates:
http://seqanswers.com/forums/showthread.php?t=5424