Duplicate Removal Of New Sequence Data?
2
0
Entering edit mode
12.0 years ago
Gabriel R. ★ 2.9k

I have reads from a eukariotic genome and there duplicate due to sequencing. In a traditional enviroment, I would align them, mark and remove duplicates but here, I have no reference.

I am wondering, is there any software that does duplicate removal of raw sequence data ? What is your experience with them ?

Sorry in advance if the question is naive.

duplicates assembly • 3.3k views
ADD COMMENT
0
Entering edit mode

fastX collapser works on raw read data http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_collapser_usage I am however not sure which kind of duplicates you want to remove.. completely same reads?

ADD REPLY
0
Entering edit mode

hmmm maybe tolerate 1 mm but a collapsed read would be have to be a consensus of the two (or more). How does this tool scale to say a full HiSeq or more ?

ADD REPLY
0
Entering edit mode

How are you going to process the data? If you assemble the reads, most assemblers will take care of the duplicates.

ADD REPLY
0
Entering edit mode

Hi Heng, we are using soapdenovo. The Panda Genome guys claim "The redundant reads were filtered at a threshold of euclid distance <= 3 and a mismatch rate of <= 0.1. We observed that the average rate of base-calling duplicates for each lane was about 0.83%, ranging from 0.00% to 8.52%." But they did that using an in-house pipeline, I was wondering if that is a procedure that one should use.

ADD REPLY
0
Entering edit mode

I do not know what that threshold is used for, probably for SNP calling. For de novo assembly, it does not matter too much whether the duplicate rate is high.

ADD REPLY
2
Entering edit mode
12.0 years ago
Raygozak ★ 1.4k

I also recommend prinseq lite, it is a very nice tool that generates statistics about your reads, and a mode to filter duplicates using four different criteria, trim bases and remove reads with less than a given mean quality. it has more options of course.

ADD COMMENT
0
Entering edit mode
11.7 years ago
Wrf ▴ 210

not quite sure why you would want to remove duplicates from the raw data, but you could try "sequniq" in the genometools package. i never tried it for millions of raw reads, but for contigs its quite fast.

ADD COMMENT

Login before adding your answer.

Traffic: 2178 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6