I'm using edgeR to test for DE genes in an expression matrix. For each sample if have a condition and a batch. I want to use the glm-functionality in edgeR to test for DE between conditions, while taking into account batches.
Some example data to show my problem. Say if have a count matrix, EM, of 10 sample with the following labels:
EM = EM # Imagine matrix of counts here...
conditions = c("con1", "con1", "con1", "con2", "con2", "con2", "con3", "con3", "con3", "con3")
batches = c("batch1", "batch2", "batch3", "batch1", "batch2", "batch3", "batch1", "con2", "con3", "con3")
The pipeline could look like this:
dge = DGEList(counts=EM) # Create object
dge = calcNormFactors(dge, method='TMM') # Normalize library sizes using TMM
design = model.matrix(~0+conditions+batches) # Create design matrix for glm
colnames(design) = c(levels(conditions), levels(batches)[2:length(levels(batches))]) # Set prettier column names
dge = estimateGLMCommonDisp(dge, design) # Estimate common dispersion
dge = estimateGLMTagwiseDisp(dge, design) # Estimate tagwise dispersion
fit = glmFit(dge,design) # Fit glm
pair_vector = sprintf("%s-%s", "con1", "con3") # Samples to be compared
pair_contrast = makeContrasts(contrasts=pair_vector, levels=design) # Make contrast
lrt = glmLRT(fit, contrast=pair_contrast) # Likelihood ratio test
My questions:
- The design matrix: There is no baseline conditions, so I remove the intersect with the 0+. Is this necessary to do for batches as well, even though they are not directly used as contrasts?
- Does the glmFit take into account norm-factors for library sizes?
Do you want to just compare "con3" and "con2" vs. "con1" or do all of the pairwise comparisons? Your setup is more targeted toward comparing things versus con1 and including an intercept rather than using contrasts.
I want to test all pairwise comparisons between conditions (no condition can be considered baseline'). I omitted the for-loop for clarity.