I have reads from a eukariotic genome and there duplicate due to sequencing. In a traditional enviroment, I would align them, mark and remove duplicates but here, I have no reference.
I am wondering, is there any software that does duplicate removal of raw sequence data ? What is your experience with them ?
Sorry in advance if the question is naive.
fastX collapser works on raw read data http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_collapser_usage I am however not sure which kind of duplicates you want to remove.. completely same reads?
hmmm maybe tolerate 1 mm but a collapsed read would be have to be a consensus of the two (or more). How does this tool scale to say a full HiSeq or more ?
How are you going to process the data? If you assemble the reads, most assemblers will take care of the duplicates.
Hi Heng, we are using soapdenovo. The Panda Genome guys claim "The redundant reads were filtered at a threshold of euclid distance <= 3 and a mismatch rate of <= 0.1. We observed that the average rate of base-calling duplicates for each lane was about 0.83%, ranging from 0.00% to 8.52%." But they did that using an in-house pipeline, I was wondering if that is a procedure that one should use.
I do not know what that threshold is used for, probably for SNP calling. For de novo assembly, it does not matter too much whether the duplicate rate is high.