In a previous question, I asked about how to do "background subtraction" for ChIP-Seq data that has matched input samples. Most of the answers there asserted that my idea of pre-processing my count data by subtracting an appropriately-scaled version of the control (i.e. ChIP-seq input) data was the wrong approach. Rather, I should provide the raw IP counts along with the input counts to edgeR and use its generalized linear modelling features to separate signal from background in a statistically rigorous way. I tend to agree with this idea, but I don't really know how to do it, because once I try to factor in the input-vs-IP contrast along with the rest of my experimental design, things get so complicated that I don't know how to construct the model matrix that describes my experimental design.
So, here is my sample table (with specific identifiers replaced by "Donor1", "Donor2", etc.):
sampledata <- data.frame(donor=rep(rep(c("Donor1", "Donor2", "Donor3", "Donor4"), each=2), 2),
celltype=rep(c("memory", "naive"), 8),
sampletype=rep(c("IP", "input"), each=8))
rownames(sampledata) <- do.call(sprintf,c("%s-%s-%s", as.list(sampledata)))
The table looks like this:
> print(sampledata)
donor celltype sampletype
Donor1-memory-IP Donor1 memory IP
Donor1-naive-IP Donor1 naive IP
Donor2-memory-IP Donor2 memory IP
Donor2-naive-IP Donor2 naive IP
Donor3-memory-IP Donor3 memory IP
Donor3-naive-IP Donor3 naive IP
Donor4-memory-IP Donor4 memory IP
Donor4-naive-IP Donor4 naive IP
Donor1-memory-input Donor1 memory input
Donor1-naive-input Donor1 naive input
Donor2-memory-input Donor2 memory input
Donor2-naive-input Donor2 naive input
Donor3-memory-input Donor3 memory input
Donor3-naive-input Donor3 naive input
Donor4-memory-input Donor4 memory input
Donor4-naive-input Donor4 naive input
There are 4 individuals (donors). Two cell types were isolated from each donor, and each sample was ChIPed, and both the IP and the input were sequenced. The result is 4 donor-matched pairs of memory/naive IP samples (rows 1 through 8), plus matched input samples for everything (rows 9-16). I want to find features with significant differences in the non-background between naive and memory in the IP samples, after controlling for inter-donor variation in the IP samples as well as any variation at all among the input samples.
Answers from the other question recommended that I should model the input samples as "Background" and the IP samples as "Background + signal", then allow edgeR to separate the signal from the background and do differential tests between the signals. This all makes perfect sense. However, as I said, I don't know how to construct the model matrix that describes such a setup.
So my question is this. What is the code for generating the appropriate model matrix for my experimental design? Typically the code would appear something like the following:
design <- model.matrix(???, data=sampledata)
where ???
is the formula that I don't know how to write. Alternatively, you could provide code that directly constructs the model matrix.
I realize that I'm just asking for the answer here to my specific question, but I'm hoping that the answer will be sufficiently illustrative that anyone who wants to use edgeR for ChIP-Seq will benefit from it. If I can get this working, I would like to submit my code to the edgeR developers for them to include as a case study to demonstrate use of edgeR on ChiP-Seq data.
More generally, is there a decent guide to writing complex formulas for the GLM functionality in edgeR and other packages? It seems like the documentation for edgeR and others assumes that you already know how to write such formulas, and simply explains how to plug them into the package using trivial examples with no more than two factors.
Ok, so I tried this approach and I found that I got more significant p-values and FDR values when I left the input samples out of the analysis entirely and just compared IP vs IP directly. In fact, when using the input samples in my analysis, no promoters were significant at a 0.05 FDR. I wonder if my design was wrong somehow, or if there's something inherently wrong with the "difference-of-differences" approach. It doesn't seem plausible that all the significant promoters in the no-input analysis are the result of changes in the input rather than changes in the IP. I'm not really sure what to make of this.
That's interesting. Have you looked at the results in a genome browser? Specifically, if you create RPM tracks of the IPs and inputs, and just compare them by eye, do the results make sense? You might also try some sanity checks via Marson et al. & Young, Cell 2008 (pmid: 18692474, see the supplement) where you evaluate the enrichment level of your promoters via 25 bp bins, and then compare the enrichments between cell types to see if it's all just scatter, how it might compare to other chosen locations, or if the differences you're seeing make sense. It's a lot of coding in R, but you could get a good picture of the situation.
Note that you can skip the "create factors" section of your code and instead to the following for creating the matrix:
I wonder if it is sufficient to just add
+donor
to the formula that you give in order to get paired tests?9 years later...has this here ever come to a working solution? I wonder how, despite all the modeling approaches, you address the compositional differences between IP and input and therefore a decent (e.g. TMM) normalization.