I am doing human transcription factor (TF) ChIP-Seq data analysis in a particular cell line. I had done the ChIP Seq data analysis previously for the same TF and had obtained around 3.7k significant peaks. My data was paired-end (150bp mate lengths). I had aligned my data with bowtie2, sorted the bam file by coordinates and have used MACS2 for peak calling. I had not removed duplicates separately and had relied on MACS2 default option of keeping one tag at a specific location. MACS2 was run with the format BAMPE against a suitable control sample. I had obtained around 3.7 significant peaks.
Recently I did a replicate of the chip-seq (same TF), and in this case, I followed the same pipeline and found more than 30k significant peaks, compared to the previous 3.7k. When visualizing the peaks in the alignment bigwig file, most of the peaks were not prominent at all, and they seemed more like background noise. Thereafter I noticed that out of the ~30k peaks, almost 29.5k had pileup values less than 10. I checked the data statistics and saw that the treatment redundancy rate was 0.78 (Almost 80% of the reads were being removed for being duplicates). In the previous experiment, the redundancy rate was 0.2.
I have the following questions:
(1) Is there a problem with the library preparation step, as I am getting so many duplicates, or am I doing something wrong with my analysis? (2) If the duplicates is indeed the problem, what value of redundancy rate should I consider the threshold for performing reliable peak calling? I have chip-seq data of another TF at two conditions, and I am noticing redundancy rates of 0.5 and 0.6.
ChIP-seq is super noisy, it can well be that data quality differs widely between experimental replicates. I see that in our and published data all the time. Lots of duplicates might indicate poor IP efficiency. I personally prefer to use dedicated software such as
samblaster
to remove duplicates before peak calling rather than leaving that to macs. Comparing raw peak numbers is problematic as noise ratio can easily create a few more or fewer peaks. You have to decide now how to go on, either you use a strict intersect keeping only peaks present in both samples. Or you use IDR to let that stats framework figure out which peaks are "consistent" or you use the dataset with better signal-to-noise ratio. I use IDR most of the time, no rule for that though. Calling against inputs is a good idea if the input sample has sufficient coverage. Often I see inputs dramatically undersequenced so that has essentially little to no use.Hi,
Are you calling the peaks again input or mock?
If yes, you need to check the input coverage.
Try to remove the duplicates prior (you can use picard) and see the loss happening in ChIP and Input data.
I am calling the peaks against input. The total number of reads for the treatment sample is 30 million +, and for input it is around 20 million.
In the Input, my redundant rate is only 0.06. I am hardly losing out on reads in Input.
Then it might be due to ChIP experiment, the Immunoprecipitation may be less efficient this time. You can use approach as suggested by ATpoint.