My experimental design consists of two experimental factors, four conditions (+/+, -/+, +/-, -/-) and four biological replicate per condition. The batch effect in this experiment comes from the fact that the first replicate for all conditions was taken separately, the second separately and so on.
Now I have written a somewhat bloated R script that tries to accomplish separate two goals:
(1) find genes that are differentially expressed between presence and absence of the two experimental factors separately, i.e. which are differentially expressed due to factor 1 alone or factor 2 alone.
(2) get TMM-normalized FPKM values where the batch effect has been removed.
To accomplish (1), I have loaded my matrix of raw counts, rounded it, defined the experimental factors and batch effect as factors, filtered by total counts, created a DGEList object, carried out TMM-normalization, made a design matrix with experimental factors + batch effect (~Batch+Factor1+Factor2), estimated dispersions with estimateGLMCommonDisp(), estimateGLMTrendedDisp() and estimateGLMTagwiseDisp(), fitted with glmFit() and then used glmLRT() for statistical significance testing twice, one for each experimental factor (with the appropriate coefficients based on the design matrix, e.g. coef=5 if 6 is the column for the experimental factor I want to test, where the first column is just the row number).
... (Intercept) batch2 batch3 batch4 Factor1 Factor2
1 1 0 0 0 0 0
# Estimating average dispersion across all genes with GLMs y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
# Estimation of genewise dispersions
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
# Fitting genewise glms
fit <- glmFit(y, design)
# likelihood test
lrt.factor1<- glmLRT(fit, coef=5)
lrt.factor2 <- glmLRT(fit, coef=6)
To get (2), I used a slightly cumbersome method. I went back to the raw counts, log2 transformed then, ran the removeBatchEffect() function from limma specifying the same design matrix (that includes batch effect) as in (1), removed the log2 transform and then ran a Trinity script that is suppose to take raw counts and spit out TMM-normalized FPKM values. The reason I did this was because of conceptual simplicity and because it could potentially do the FPKM step by just running an existing script.
I have a few questions
- Should the design matrix used for the limma function
removeBatchEffect()
not include the batch effect and only the experimental factors that you want to "keep", i.e. (~factor1+factor2)? The manual says "The design matrix is used to describe comparisons between the samples, for example treatment effects, which should not be removed." - Does (2) make sense apart from the design matrix issue? Or is there a smarter way to get batch-corrected values from the edgeR procedure in (1)? If one can get batch-corrected TMM-normalized counts from it, is it possible to use
cpm()
and then divide by length of feature to get TMM-normalized FPKM values? Does the order of batch correction and TMM-normalization matter? - Is it correct to use coef=5 if Factor1 is the column for the experimental factor I want to test (see the partial design matrix above)? Or did I inadvertently test differential expression between batches?
- Is it correct to use coef instead of contrast? The bottom part of my design matrix looks like this:
attr(,"contrasts") attr(,"contrasts")$batch [1] "contr.treatment"
attr(,"contrasts")$factor1
[1] "contr.treatment"
attr(,"contrasts")$factor2
[1] "contr.treatment"
It looks like the correct comparison(s) are already built-in to the design matrix?
Why are you rounding counts once you input them? That suggests that they're not integers to begin with and that you should be using limma/voom rather than edgeR.
They are non-integers because RSEM handles multi-mapping reads by assigning a fraction of the read based on proportion of uniquely mapping reads and I round them because not doing so produces an error message in a later step and the Trinity pipeline that uses qCML instead of GLMs does it.
The EdgeR user manual says that "edgeR is not designed to work with estimated expression levels, for example as might be output by Cuinks. edgeR can work with expected counts as output by RSEM, but raw counts are still preferred". This suggests that it is acceptable to use edgeR with this data?
I'd personally prefer to stick with limma in cases like this (I know Gordon Smyth has recommended that in the past, though it could well be that his recommendation has evolved since I last looked, so if the edgeR manual explicitly gives its OK then I wouldn't worry about the validity of the results).
After I get this to work, I will be exploring voom(). Thank you for your suggestions.