Hi all,
I have recently started analyzing ChIP-seq data. I have two datasets from GEO for two different transcription factors in same sample and want to compare their overlap binding sites and determine genes that are co-regulated by these TFs. I have aligned the two datasets using the same pipeline and have used MACS2 for peakcalling. I was going to use bedtools intersect
to determine the overlapping peaks. Before I proceed, however, I wanted to know whether I should be normalizing for the library size? Typically the data is normalized for Libray size by the statistical method used to detect differential abundances.
In this instance, are these peak numbers comparable to each other as is? If I need to normalize, how can I proceed with it?
Since these are two different TF, you must be wanted to identify maximum peaks possible for that particular TF and that depends on the ChIP efficiency and sequencing depth. So I would suggest to do peak calling without library size normalization and then normalize by sequencing depth for downstream comparison. If the TF is same under different condition, we should normalize by sequencing depth by input or sample having lowest depth.
Thank you so much for your reply, Prakash. Yes, I want to identify the maximum number of peaks for each transcription factors. So, currently I have performed peak calling without any normalization between the two TFs although for each IP is normalized with respect to its corresponding Input. I now have the MACS2 output peaks file. As I do the downstream comparison, I had some follow-up questions. Also, though the samples are biologically the same, the ChIP experiments are performed years apart and in different labs.
I would really like to understand the normalization aspect for 2 different ChIP seq data though this might be a naive question!