Entering edit mode
6.1 years ago
florian.noack
▴
20
Hi, I dealing right now with some ChIP-seq data generated from a very low number of cells. Data look so far good but I noticed that some loci got heavily amplified during library preparations which is I guess a consequence of working with low amounts of material. I looking now for a tool to restrict the number of identical reads per loci at for example 3 (e.g. if I have 10 identical reads 7 will be removed and 3 remain). As far as I read both picard tools as well as samtools remove duplicates in a all or nothing manner. Somebody has a handy solution for me (Iam biologist :p).
Thanks, Flo
I am not immediately aware of such a tool. What is special about requirement of leaving three instead of just one? You could use
clumpify.sh
(Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ) which has an option to add a count field to the fastq header after deduplicating the data which you could use to keep track of how many duplicates were there originally.Because its ChIP-seq and I would expect to have some duplicates simply because we reduce extremely genomic complexity especially in the case using just a few cells (additional lost of complexity simply by losing some DNA fragments after shearing). Iam not sure which exact number i will allow later its just to play a bit around but removing all of them is maybe to harsh in my case.
prinseq can remove duplicated sequences. If you have a high levels of read-duplication you may consider to remove them, if not, I think that use arbitrary filters may cause absolutely biased analysis.