Question

Calculation of ChIP-seq normalization factors with non-conventional spike-in assumptions

0

Entering edit mode

13 months ago

jared.andrews07 ★ 18k

I have an experimental setup where there are known global shifts in levels of our histone mark of interest due to a histone mutation. We include spike-in chromatin in each sample, but we know that the spikein levels are not truly identical across samples given the technical difficulties of quantitation/spike-in addition. However, we have inputs for each sample that we can assume do have equivalent ratios of spike-in chromatin given when the chromatin is added. This means that we can calculate the percentage of spike-in reads for each sample as spikein_input_read% and spikein_chip_read% to derive a ratio between them.

This ratio does not inherently account for the inevitable signal to noise differences present in samples with/without the mutation.

So my question is ultimately - given sample-wise values of spikein_input_read% and spikein_chip_read%, what might be potential options to account for both library size and global composition differences during normalization?

I have read the relevant sections of both the DiffBind and csaw documentation thoroughly, but both assume identical spike-in levels across all samples. Are my thoughts above folly or is there a way to normalize this dataset in a way that makes sense?

This question has also been cross-posted to the Bioconductor support site, and relevant answers provided there will be linked/summarized here.

DiffBind csaw normalization spikein ChIPseq • 1.0k views

ADD COMMENT • link 13 months ago by jared.andrews07 ★ 18k

0

Entering edit mode

Tagging: Rory Stark

ADD REPLY • link 13 months ago by GenoMax 148k

0

Entering edit mode

Also tagging ATpoint since I expect he may have an idea or two.

ADD REPLY • link 13 months ago by jared.andrews07 ★ 18k

0

Entering edit mode

How about a two-step approach? In step1 you apply your strategy and then inspect the MA-plots of the pairwise comparisons. If the experiment is not totally abnormal then these should have more or less the typical arrowhead-like shape. See if the rightmost part of the plot is aligned with y=0. If not, use these points/regions on the right of the plot with large baseMean (large baseMean in all conditions, therefore hopefully somewhat representative for non-differential regions) and recalculate the size factors only with them. With global shifts you need to decide which regions you use for normalization if you want to go beyond the spike-ins, and the MA-plot strategy usually works for me. This is at least data-driven and not random like selecting "house keeping regions", e.g. peaks near Gapdh or beta-actin, which is probably nonsense.

ADD REPLY • link 13 months ago by ATpoint 86k

0

Entering edit mode

An interesting idea. A bit more context and some figures. So here is an MA plot for an H3K27me3 experiment in which the mark is largely ablated in samples with an H3.3K27M mutation, with focal retention at sites with particularly high signal in the H3 WT samples:

enter image description here

Note that we know this isn't accurate, as western, IF, IHC, you name it, all show that the mark is dramatically decreased. And we have our spike-ins that show it quite clearly as well. Based on your previous very helpful answer, we use the reciprocal of the spike-in ratio (spikein_input_read% / spikein_chip_read%) to derive a scaling factor for track generation via DeepTools, which generally works well enough.

We can use these directly as our normalization factors in DiffBind, and end up with an MA plot that looks as such:

enter image description here

This looks more as we would expect, but doesn't account for library size, as DiffBind appears to use the factors provided directly. (I think deepTools accounts for library size if --normalizeUsing is set to an appropriate option, but I should probably double check). I could multiply the library-derived size factors by these values, which would then take read depth into account. As you mention, there are a set of high counts peaks that we'd consider "non-differential" between the two, and I rather like the idea of incorporating them. I will give that a shot.

I really end up confusing myself when dealing with the factors since I struggle to remember whether high/low values end up increasing or decreasing the counts/values based on which tool is being used.

I ask these questions with this very strong and obvious example because we have another dataset where we suspect there might be a global(ish) shift going on, but in a much more subtle manner. Signal to noise variation is making it trickier to confirm, and the effect isn't large enough to reliably show up in westerns.

ADD REPLY • link 13 months ago by jared.andrews07 ★ 18k