Hi, We are looking at the binding sites of TF that is not in the nucleus prior to its activation and one hour after giving a hormone treatment to the cells the TF is in the nucleus and can bind to DNA. We did ChIP-seq for both before and after the hormone treatment and our input is from the cells with the hormone. When I'm using MACS2 to call peaks I'm getting a lot more peaks in the non-treated cells (40000 compare to only 500 in the treated cells), which doesn't make any sense because the TF is not supposed to bind to DNA in the untreated cells. In addition, we know what is the motif of the factor and most of the peaks in the untreated cells don't have the motif (4%) while 70% of the 500 peaks from the treated cells do. My guess is that the peaks are some phantom peaks or noise, but why do I still detect them with MACS analysis and how can I remove these sort of peaks or "clean" my peaks so I can believe the peaks that I'm getting from the treated cells?
If you have any other suggestion to why I'm getting this odd results from the untreated cells I would also like to hear. I'll just mention that I have 15M reads in the treated cells and in the input and 8M reads in the untreated cells.
Thanks!
The only real way to get to the bottom of this is to look at the area around a few of the peaks in your untreated samples in IGV or something similar. My guess is that you had more PCR cycles in your untreated samples so you have "blocky" alignments due to low sequencing complexity. That would then correspond to excessive peaks in the untreated sample.
Thank you, you are right, when I'm looking in IGV I can see that I have a lot of reads in the same place and they are not distributed nicely and some of them are the same reads (only40% of reads are not duplicates). I have a question though, MACS2 is removing duplicates reads before the peak calling no? So why still this overamplified regions, with the same reads are called as peaks?
Even with duplicates removed, if you have large areas with little/no coverage then any area with even a bit of coverage will look like a peak.
I know for macs1, if you have IP and Input with different amount of reads, it will downsample the higher reads sample to the lower reads sample.
if you only have 8M in the untreated cells, the input (15million) will be downsampled to 8 million as well. This might affect your peak calling results. And if you expect not to see binding in the untreated sample, the low number of reads maybe inherent. Just my 2c.