Question

Duplicate Removal Of New Sequence Data?

0

Entering edit mode

12.0 years ago

Gabriel R. ★ 2.9k

I have reads from a eukariotic genome and there duplicate due to sequencing. In a traditional enviroment, I would align them, mark and remove duplicates but here, I have no reference.

I am wondering, is there any software that does duplicate removal of raw sequence data ? What is your experience with them ?

Sorry in advance if the question is naive.

duplicates assembly • 3.3k views

ADD COMMENT • link updated 11.7 years ago by Wrf ▴ 210 • written 12.0 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

fastX collapser works on raw read data http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_collapser_usage I am however not sure which kind of duplicates you want to remove.. completely same reads?

ADD REPLY • link 12.0 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

hmmm maybe tolerate 1 mm but a collapsed read would be have to be a consensus of the two (or more). How does this tool scale to say a full HiSeq or more ?

ADD REPLY • link 12.0 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

How are you going to process the data? If you assemble the reads, most assemblers will take care of the duplicates.

ADD REPLY • link 12.0 years ago by lh3 33k

0

Entering edit mode

Hi Heng, we are using soapdenovo. The Panda Genome guys claim "The redundant reads were filtered at a threshold of euclid distance <= 3 and a mismatch rate of <= 0.1. We observed that the average rate of base-calling duplicates for each lane was about 0.83%, ranging from 0.00% to 8.52%." But they did that using an in-house pipeline, I was wondering if that is a procedure that one should use.

ADD REPLY • link 12.0 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

I do not know what that threshold is used for, probably for SNP calling. For de novo assembly, it does not matter too much whether the duplicate rate is high.

ADD REPLY • link 12.0 years ago by lh3 33k

score 2 · Answer 1 · 2012-12-10

2

Entering edit mode

12.0 years ago

Raygozak ★ 1.4k

I also recommend prinseq lite, it is a very nice tool that generates statistics about your reads, and a mode to filter duplicates using four different criteria, trim bases and remove reads with less than a given mean quality. it has more options of course.

ADD COMMENT • link 12.0 years ago by Raygozak ★ 1.4k

score 0 · Answer 2 · 2013-03-05

0

Entering edit mode

11.7 years ago

Wrf ▴ 210

not quite sure why you would want to remove duplicates from the raw data, but you could try "sequniq" in the genometools package. i never tried it for millions of raw reads, but for contigs its quite fast.

ADD COMMENT • link 11.7 years ago by Wrf ▴ 210