Question

Tools to remove duplicate or substring reads

0

Entering edit mode

8.6 years ago

chjiao3456 ▴ 40

Is there any efficient tool to remove substring reads or duplicate reads from NGS data set? I know that readjoiner could remove the duplicated reads, but seems not work on substring reads. Thanks.

Example: Duplicates: read1: AGTCAT read2: AGTCAT In this case, only one read will be kept.

Substring: read1: GTCA read2: AGTCAT In this case, read1 will be removed.

next-gen sequencing alignment • 1.8k views

ADD COMMENT • link 8.6 years ago by chjiao3456 ▴ 40

score 0 · Answer 1 · 2016-11-29

0

Entering edit mode

8.6 years ago

Brian Bushnell 20k

The most efficient tool for this purpose is Dedupe from the BBMap package. However, it requires all reads to be stored in memory, so it needs a lot of memory. Can you explain in more detail what you are trying to do?

ADD COMMENT • link 8.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks for your help. I have added examples in the question.

ADD REPLY • link 8.6 years ago by chjiao3456 ▴ 40

score 0 · Answer 2 · 2016-11-30

0

Entering edit mode

8.6 years ago

chjiao3456 ▴ 40

Just noticed that SGA tool is able to do this. Collapse Reads That Are Substrings Of Other Reads In Same Library

ADD COMMENT • link 8.6 years ago by chjiao3456 ▴ 40