To normalize ChIP-seq data with a separate-species spike-in, I'm following the advice described in this GitHub issue:
- Align to spiked-in species.
- Compute scaling factor (e.g.,
multiBamSummary ... --scalingFactors
). Note, you should have a look at some of the samples to ensure that the spiked-in species has vaguely uniform coverage. If the coverage is very spotty then you may need to use only the covered regions to compute scaling factors. - Use the scaling factors from step 2 with
bamCoverage
with your on-species alignments.
From the multiBamSummary bins
documentation:
--scalingFactors FILE
Compute scaling factors (in the DESeq2 manner) compatible for use with bamCoverage and write them to a file. The file has tab-separated columns "sample" and
"scalingFactor". (default: None)
My understanding is that the DESeq2
-styled scaling factor calculations work with alignment counts to bins, rather than counts of alignments to genes or other features as in standard RNA-seq analyses.
With this in mind, the default bin size for multiBamSummary bins
is 10,000 bp:
--binSize INT, -bs INT
Length in bases of the window used to sample the genome. (Default: 10000)
However, I'm working with the S. cerevisiae model and using S. pombe as the spiked-in species. Both fungal genomes (~12 Mb) are significantly smaller and more feature-dense compared to the human (3.2 Gb) and mouse (2.5 Gb) genomes.
So, my question is this: Should I not use the default value and instead decrease the value of --binSize
when calculating scaling factors from S. pombe alignments?
If so, what would be an appropriate value? To illustrate, for a 3.2 Gb genome, using a 10 kb bin size yields 320,000 bins; for a 12 Mb genome, if we solve 12 Mb รท x = 320,000, we see that x = 37.5 bp, which we could round up to 40-bp bins. Alternatively, we might increase it to 150 bp, approximating the typical rounded-up size of a nucleosome.
An additional question: Should I further adjust the --binSize
value based on the factor being ChIPped? For instance, should I opt for a larger bin size for a near-ubiquitous factor like histone H3, and a smaller bin size for ChIPs involving, e.g., RNA Pol 1, 2, 3? Perhaps even smaller for a transcription factor?
Thank you.
I think your best answer could be derived from visualizing the coverage as suggested in step 2.
Is it relatively uniform? if so you may identify what bin size may capture the variability in coverage. Does a 10kb bin tend to capture lot of variability (eg regions with 0-10x coverage and regions with 100+x coverage)? then maybe I would shorten the binSize.
You could also consider calling peaks and supplying them to the
multiBamSummary BED-file
version to count over "features". Maybe more pertinent for TF ChIPs.As a note, I think having too small a binSize is less an issue than having too large a binSize.