I am analyzing chip-seq data for several DNA binding proteins and doing differential binding analysis between two different conditions. I am using HOMER for differential analysis which basically uses the standard Deseq2 pipeline to compare counts between peak regions. Using mac2 gives me ~20,000 peaks for each protein of interest but when I do differential analysis, none of these peaks show differential binding with FDR cutoff 5% between my treatment conditions. This is very unexpected because the proteins are expected to lose binding in my treatment condition. I am suspicious that I don’t even get a single differential peak out of ~20,000 peaks in three different chip-seq datasets.
Looking at the p-values of my differential analysis shows me the there are hundreds of peaks with significant unadjusted p-values but all of them have very high adjusted p-values. Also, almost all peaks have identical adjusted p-value of ~0.98 which is very strange. After reading previous threads about this I made a histogram of my unadjusted p-vales which is not uniform but upward sloping (similar to scenario D in this article).
I am at a loss for what could be causing this and would appreciate any tips.
Thanks
If your p-value distribution does look like D in that article, there might be something wrong with the methodology itself, or the data going into the test. The article mentions it, but if the null hypothesis is true you should get a uniform distribution, not a preponderance of p-values close to 1.
I would recommend trying a different program first, such as DiffBind on bioconductor, and seeing if you run into the same problem.
Thanks, I will try DiffBind as well. Also just to clarify my histograms are similar to D in the linked article but not exactly like it. I am having some trouble posting pictures but they are upward sloping with abundance of high p values but without a spike at 1.
As suggested in the linked article, the likely reason is violation of the assumptions of the statistical test that generates the p-values. For example if the test assumes independent samples but they are not then you could end up with p-values distributed like in situation D. However, I think this might also be happening when dealing with count data where having many samples with low counts can give an excess of high p-values. In this case, you could try doing more aggressive QC before processing the data.
EDIT: This part of a DESeq2 tutorial may be of interest.
Thanks, this is exactly what my histograms looks like. Do you think the FDRtools approach they use in the tutorial is something worth trying? (as I am not sure how ChIP-seq count data differs from RNA-Seq count data)
Yes exploring with FDRtools is worth a shot.