Hello,
I got some sequencing data (chip-seq, single, 50-75bp) with at times high duplication rates. Since the duplicates will not be used for the peak calling downstream I wonder if it would be better to remove these before mapping saving some disc space and cluster time.
To filer out duplicates I want to use fastp with the default (3) dup_calc_accuracy level.
My questions:
- is such deduplication at the fastq level a sensible step in chip-seq?
- are there any noticeable benefits using higher dup_calc_accuracy levels?
Many thanks for your help / opinions
DK
You could also use
clumpify.sh
from BBMap suite and keep the count of dedeuplicated reads in fastq headers. Do you have such excessive duplication that you are worried about this?If we use MACS2 do we need to remove duplicate sequences with samtools rmdup ?
The Duplicates Dilemma of ChIP-Seq
I have seen +40% at times. So if these are failed samples I want them to fail early.
Also I got multiple 2-3 runs per sample, so merging heavily duplicated runs at the be it FASTQ or BAM stage and then discarding duplicates looks counterintuitive. I will run samtools markdup on merged BAMs anyway.
40% does by no means indicate a failed sample. The ChiP-ed protein can either be not very abundant, or the antibody is just not good. Still can be a valid sample. ChIP is tricky, definitely look at the data at the IGV before making a decision.
The high duplication rates are from patients hist-pat paraffin blocks. Little starting material + histone marks. I try to use what is left (no duplicates, primary alignments, MAPQ >=15 ) but MACS often throws the towel and rightly so since deeptools fingerprints look scary. Sicer/epic2 still call bunch of peaks. Quality of these is something for another topic.
re clumpify.sh: I will run the tests to make sure that fastp and clumpify do more or less the same thing. While I am a fan of clumpify I am not sure if it can produce some easy to parse run record as fastp.