Hi,
I have two conceptual/technical questions about spike-in normalisation of ChIP-seq data.
We did spike-in for PolII ChIP-seq from a mouse cell line with ActiveMotif kit, which uses anti-H2Av antibody with drosophila cells. The mouse and fly chromatin are mixed and a mix of anti-mouse PolII and anti-H2Av antibodies are used to IP. I aligned the reads to mouse and drosophila genome separately. And now I am trying to get normalization factors for each replicate using DiffBind (v. 3.8.3), which I would like to use for differential peak analysis as well as scaling bigwig files.
Qn1: From what I can gather, in DiffBind, by including the spikein bam filles in the Spikein column of sampleSheet, I have the option to use the spikein library size or the csaw background bins method with RLE or TMM. Is there a built-in option to use the reads in peaks (RiP) method from the spike-in datasets (i.e., read counts in H2Av peaks) for normalizing the mouse reads? If there is an option to supply the peaks in the spike-in dataset, I cannot find it.
To my naive thinking, this option better controls for IP efficiency (~signal/noise ratio) per replicate than library size or background bins (assuming the antibody and chromatin concentrations in the replicates are constant). Am I missing something conceptually?
Qn2: To get the normalisation factors using the RiP from the spike-in datasets, I made a dba object using the drosophila sequences ONLY (drosophila bam files + H2Av peaks) and tried the various normalisations (code below). Differential peaks should not be found between conditions here, since these are spike-in chromatin IPs. It looks like the RiP RLE normalisation does an eerily excellent job at normalising the counts, while the library size or background bins methods still give a lot significantly different peaks when they shouldn't (although the fold change is minimal).
I understand, with RiP RLE normalisation, I am more directly normalising the peak signals and plotting the peak signal, so the fact that it has the best effect in this analysis is not surprising in itself. But I am surprised by the near absence of variation between the replicates (meaning a lot of the variation comes from IP efficiency between samples?). Does this mean, RiP RLE normalisation using spike-in peaks is the correct method to normalise my mouse read counts?
I am pretty new to ChIP-seq/DiffBind analyses. Please let me know if anything is wrong or unclear. Thank you!
>spik.db <- dba(sampleSheet = "../input_data/sampleSheets/P2_dm3_spikein_sampleSheet_DiffBind.csv",
minOverlap = 1)
>spik.db
4 Samples, 11449 sites in matrix:
ID Tissue Factor Condition Treatment Replicate Intervals
1 1 CL P2 WT Full-media 1 10715
2 2 CL P2 WT Full-media 2 9763
3 3 CL P2 KO Full-media 1 11166
4 4 CL P2 KO Full-media 2 11480
>spik.counts <- dba.count(spik.db,minOverlap = 1)
>spik.model <- dba.contrast(spik.counts, minMembers=2, reorderMeta=list(Condition="KO"))
>spik.model <- dba.analyze(spik.model) #default lib size
>dba.plotMA(spik.model, bNormalized = F, sub="No norm")
>dba.plotMA(spik.model, bNormalized = T, sub="Norm: Lib size")
>rm(spik.model)
>spik.model <- dba.contrast(spik.counts, minMembers=2, reorderMeta=list(Condition="KO"))
>spik.model<-dba.normalize(spik.model,normalize = "RLE",background=T) # bkgd bins RLE normalisation
>spik.model <- dba.analyze(spik.model)
>dba.plotMA(spik.model, bNormalized = T, sub="Norm: bkgd bins RLE")
>rm(spik.model)
>spik.model <- dba.contrast(spik.counts, minMembers=2, reorderMeta=list(Condition="KO"))
>spik.model<-dba.normalize(spik.model,normalize = "RLE") # RiP RLE normalisation
>spik.model <- dba.analyze(spik.model)
>dba.plotMA(spik.model, bNormalized = T, sub="Norm: RiP RLE")
>db_report<-dba.report(spik.model,th=1,bCounts = T) #to examine normalized counts of RiP RLE
>plot(db_report$Fold~db_report$Conc,main="RLE (RiP) y-axis zoomed")
cross-posted to https://support.bioconductor.org/p/9148431/