Question

Modeling Chip-Seq Background Using Edger'S Glm Functionality?

1

Entering edit mode

12.3 years ago

Ryan Thompson ★ 3.6k

In a previous question, I asked about how to do "background subtraction" for ChIP-Seq data that has matched input samples. Most of the answers there asserted that my idea of pre-processing my count data by subtracting an appropriately-scaled version of the control (i.e. ChIP-seq input) data was the wrong approach. Rather, I should provide the raw IP counts along with the input counts to edgeR and use its generalized linear modelling features to separate signal from background in a statistically rigorous way. I tend to agree with this idea, but I don't really know how to do it, because once I try to factor in the input-vs-IP contrast along with the rest of my experimental design, things get so complicated that I don't know how to construct the model matrix that describes my experimental design.

So, here is my sample table (with specific identifiers replaced by "Donor1", "Donor2", etc.):

sampledata <- data.frame(donor=rep(rep(c("Donor1", "Donor2", "Donor3", "Donor4"), each=2), 2), 
                         celltype=rep(c("memory", "naive"), 8), 
                         sampletype=rep(c("IP", "input"), each=8))
rownames(sampledata) <- do.call(sprintf,c("%s-%s-%s", as.list(sampledata)))

The table looks like this:

> print(sampledata)
                     donor celltype sampletype
Donor1-memory-IP    Donor1   memory         IP
Donor1-naive-IP     Donor1    naive         IP
Donor2-memory-IP    Donor2   memory         IP
Donor2-naive-IP     Donor2    naive         IP
Donor3-memory-IP    Donor3   memory         IP
Donor3-naive-IP     Donor3    naive         IP
Donor4-memory-IP    Donor4   memory         IP
Donor4-naive-IP     Donor4    naive         IP
Donor1-memory-input Donor1   memory      input
Donor1-naive-input  Donor1    naive      input
Donor2-memory-input Donor2   memory      input
Donor2-naive-input  Donor2    naive      input
Donor3-memory-input Donor3   memory      input
Donor3-naive-input  Donor3    naive      input
Donor4-memory-input Donor4   memory      input
Donor4-naive-input  Donor4    naive      input

There are 4 individuals (donors). Two cell types were isolated from each donor, and each sample was ChIPed, and both the IP and the input were sequenced. The result is 4 donor-matched pairs of memory/naive IP samples (rows 1 through 8), plus matched input samples for everything (rows 9-16). I want to find features with significant differences in the non-background between naive and memory in the IP samples, after controlling for inter-donor variation in the IP samples as well as any variation at all among the input samples.

Answers from the other question recommended that I should model the input samples as "Background" and the IP samples as "Background + signal", then allow edgeR to separate the signal from the background and do differential tests between the signals. This all makes perfect sense. However, as I said, I don't know how to construct the model matrix that describes such a setup.

So my question is this. What is the code for generating the appropriate model matrix for my experimental design? Typically the code would appear something like the following:

design <- model.matrix(???, data=sampledata)

where ??? is the formula that I don't know how to write. Alternatively, you could provide code that directly constructs the model matrix.

I realize that I'm just asking for the answer here to my specific question, but I'm hoping that the answer will be sufficiently illustrative that anyone who wants to use edgeR for ChIP-Seq will benefit from it. If I can get this working, I would like to submit my code to the edgeR developers for them to include as a case study to demonstrate use of edgeR on ChiP-Seq data.

More generally, is there a decent guide to writing complex formulas for the GLM functionality in edgeR and other packages? It seems like the documentation for edgeR and others assumes that you already know how to write such formulas, and simply explains how to plug them into the package using trivial examples with no more than two factors.

chip-seq edger • 5.3k views

ADD COMMENT • link updated 3.6 years ago by ATpoint 86k • written 12.3 years ago by Ryan Thompson ★ 3.6k

score 3 · Answer 1 · 2012-08-24

"It seems like the documentation for edgeR and others assumes that you already know how to write such formulas..."

You can say that again. I find it confusing as well, but am usually able to figure it out. Gordon Smyth is usually very good at answering questions about design matrices on the bioconductor mailing list, and if you search around you'll find several examples to puzzle through. But I've never read a general guide written for non-statisticians.

I think this is what you're after. I re-arranged your matrix because it's easier for me to think about the resulting design matrix.

sampledata <- sampledata[order(sampledata$celltype),]
print(sampledata)
                     donor celltype sampletype
Donor1-memory-IP    Donor1   memory         IP
Donor2-memory-IP    Donor2   memory         IP
Donor3-memory-IP    Donor3   memory         IP
Donor4-memory-IP    Donor4   memory         IP
Donor1-memory-input Donor1   memory      input
Donor2-memory-input Donor2   memory      input
Donor3-memory-input Donor3   memory      input
Donor4-memory-input Donor4   memory      input
Donor1-naive-IP     Donor1    naive         IP
Donor2-naive-IP     Donor2    naive         IP
Donor3-naive-IP     Donor3    naive         IP
Donor4-naive-IP     Donor4    naive         IP
Donor1-naive-input  Donor1    naive      input
Donor2-naive-input  Donor2    naive      input
Donor3-naive-input  Donor3    naive      input
Donor4-naive-input  Donor4    naive      input

# create factors
celltype <- factor(sampledata$celltype)
sampletype <- factor(sampledata$sampletype)

# create the design matrix
design <- model.matrix(~celltype+celltype:sampletype)

If you examine the matrix you can see what's going on: "(Intercept)" "celltypenaive" "celltypememory:sampletypeIP" "celltypenaive:sampletypeIP". Your main interest is the interaction between celltype and sample type. The first coefficient is the ratio between cell types (not interesting), but the second and third coefficients are the IP/input ratios for memory and naive cells respectively. So get the values for those coefficients.

# assuming y is your DGEList object
# all norm factored and dispersioned up
fit <- glmFit(y,design)

# get the main coefficients of interest
memory <- glmLRT(y,fit,coef="celltypememory:sampletypeIP")
naive <- glmLRT(y,fit,coef="celltypenaive:sampletypeIP")

# take the contrast between the two to find the differences
mem_nai_contrast <- glmLRT(y,fit,contrast=c(0,0,1,-1))

This is probably the simplest approach because it treats your donors as simply 4 biological replicates, and ignores that fact that they're each matched between IP and input. But it might get you on the map in terms of at least looking at the data.

score 1 · Answer 2 · 2012-08-23

1

Entering edit mode

12.3 years ago

Neilfws 49k

To answer your last, more general query: the limma user guide (PDF) is an excellent resource. Obviously its main focus is use of the limma package for microarray analysis, but it includes very good discussions of linear models in general (section 7) and model matrix design (section 8), with case studies (section 11).