Question

How To Determine Pcr Duplicate/Redundant Reads In Ngs Data?

0

Entering edit mode

12.1 years ago

JacobS ▴ 1000

Hi,

I am looking for various methods for determining PCR duplicates/redundant reads in NGS data, and so far have come across the "mark duplicates" method in Piccard, and the the rmdup method in SAMTools. Does anyone else know of other software packages that performs this function?

Thanks!

EDIT: I will aggregate any software found to be able to mark PCR duplicates in a list:

SAMTools
Piccard

pcr duplicates qc • 6.4k views

ADD COMMENT • link updated 12.1 years ago by Istvan Albert 103k • written 12.1 years ago by JacobS ▴ 1000

0

Entering edit mode

Just curious: why are you looking for other tools? Is there some feature or behavior you're looking for that Picard and Samtools does not currently provide?

ADD REPLY • link 12.1 years ago by Dan D 7.4k

score 0 · Answer 1 · 2013-08-12

There are a few ways to go about it. There are tools that

look for exact matches via an associative array (hash, dictionary): for example the fastx_collapser in the fastx toolkit.
look for exact matches by sorting the sequences and removing consecutive exactly identical sequences, for that you could use a combinations of command line tools such as of sort and uniq
look for reads that align over the same region, for this work the data would need to be aligned against a reference genome: samtools rmdup works this way
cluster the reads and merge reads that are very similar to one another using a tool like uclust

Ideally the best way to remove duplicates is that performed after alignment but depending on the problem that may not be feasible.

For more details search this site for "remove duplicates" to find good posts on various tools and techniques.