I have been analyzing RNA Polymerase II ChIP-seq datasets available at NCBI's Gene Expression Omnibus (GEO). I'm working with datasets from have been published by reputable labs.
I'm repeatedly finding that only the IP fraction is provided, and I assume the input fraction was not sequenced.
When I refer to the publications where the data is reported, I find y-axis labels such as "Counts Per Million", "Fold Enrichment", or "Spike-In Normalized". It appears that many labs have foregone the input normalization completely and are solely using a spike-in control, generally in the form of chromatin from an independent species, for normalization. I understand that this type of control would allow for normalization of library size or technical variation between samples. However, I do not see how a spike-in control could be used to normalize for site-based relative enrichment.
Am I missing something? Isn't an input sample a necessary control for accurate peak calling in ChIP-seq?
Fascinating!
We also use inputs when doing metagene to protect against those situations where regions up or downstream of our metagene alignment points (e.g. TSSes) are systematically more or less sequencible/ alignable.