Question

Remove specific number of identical reads from fastq or bam files

0

Entering edit mode

6.6 years ago

florian.noack ▴ 20

Hi, I dealing right now with some ChIP-seq data generated from a very low number of cells. Data look so far good but I noticed that some loci got heavily amplified during library preparations which is I guess a consequence of working with low amounts of material. I looking now for a tool to restrict the number of identical reads per loci at for example 3 (e.g. if I have 10 identical reads 7 will be removed and 3 remain). As far as I read both picard tools as well as samtools remove duplicates in a all or nothing manner. Somebody has a handy solution for me (Iam biologist :p).

Thanks, Flo

ChIP-Seq duplicates • 1.3k views

ADD COMMENT • link 6.6 years ago by florian.noack ▴ 20

0

Entering edit mode

I am not immediately aware of such a tool. What is special about requirement of leaving three instead of just one? You could use clumpify.sh (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ) which has an option to add a count field to the fastq header after deduplicating the data which you could use to keep track of how many duplicates were there originally.

ADD REPLY • link 6.6 years ago by GenoMax 151k

0

Entering edit mode

Because its ChIP-seq and I would expect to have some duplicates simply because we reduce extremely genomic complexity especially in the case using just a few cells (additional lost of complexity simply by losing some DNA fragments after shearing). Iam not sure which exact number i will allow later its just to play a bit around but removing all of them is maybe to harsh in my case.

ADD REPLY • link 6.6 years ago by florian.noack ▴ 20

0

Entering edit mode

prinseq can remove duplicated sequences. If you have a high levels of read-duplication you may consider to remove them, if not, I think that use arbitrary filters may cause absolutely biased analysis.

ADD REPLY • link 6.6 years ago by Buffo ★ 2.4k