How To Determine Pcr Duplicate/Redundant Reads In Ngs Data?
1
0
Entering edit mode
11.3 years ago
JacobS ▴ 990

Hi,

I am looking for various methods for determining PCR duplicates/redundant reads in NGS data, and so far have come across the "mark duplicates" method in Piccard, and the the rmdup method in SAMTools. Does anyone else know of other software packages that performs this function?

Thanks!

EDIT: I will aggregate any software found to be able to mark PCR duplicates in a list:

  • SAMTools
  • Piccard
pcr duplicates qc • 6.2k views
ADD COMMENT
0
Entering edit mode

Just curious: why are you looking for other tools? Is there some feature or behavior you're looking for that Picard and Samtools does not currently provide?

ADD REPLY
0
Entering edit mode
11.3 years ago

There are a few ways to go about it. There are tools that

  • look for exact matches via an associative array (hash, dictionary): for example the fastx_collapser in the fastx toolkit.
  • look for exact matches by sorting the sequences and removing consecutive exactly identical sequences, for that you could use a combinations of command line tools such as of sort and uniq
  • look for reads that align over the same region, for this work the data would need to be aligned against a reference genome: samtools rmdup works this way
  • cluster the reads and merge reads that are very similar to one another using a tool like uclust

Ideally the best way to remove duplicates is that performed after alignment but depending on the problem that may not be feasible.

For more details search this site for "remove duplicates" to find good posts on various tools and techniques.

ADD COMMENT

Login before adding your answer.

Traffic: 2014 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6