I want to normalize data for enrichment based method (eg., MeDIP). MeDIP - captures methylated DNA sites.
Lets say that I have two samples: real sample (targets modified DNA) and dummy control (reads are randomly distributed along the genome). Number of reads in real sample is N times greater than number of reads in dummy control sample.
My question is: should I normalize number of reads between two samples?
case A: On one hand, it is logical to normalize number of reads as I will probably want to compare mean coverage between my samples. In this case, I can divide coverage per CG by total number of reads.
case B: On the other hand, maybe lower number of reads in dummy control is a result of a biological process (eg., in control sample there are no methylated DNA sites, thus no targets to be enriched and that's why we are getting much lower number of reads for this sample).
I know that a common strategy is to normalize number of reads. But what if different number of reads is a biological result? Can we know this? I am interested how community is dealing with this kind of a problem.
I will add that the control read distribution isn't random, because this is an important point in understanding the experiment.
There will be spikes in the controls caused by artifacts, e.g. PCR amplification artifacts, dodgy alignments, and chromatin accessibility (especially if there is some sort of size selection).
If you load the bams into IGV you will see this. The control is used to look for enrichment over this background noise. Removing reads will shrink you true and bogus peaks, making them hard to see.