I have an RNA-seq dataset comprising four samples, each with corresponding control and treatment, resulting in a total of eight RNA-seq datasets: sample1_control, sample1_treatment, sample2_control, sample2_treatment, and so on. My objective is to identify differentially expressed genes (DEGs) using a pairwise approach in edgeR.
Essentially, I adhered to the guidelines provided in the edgeR manual, which can be found at https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf, specifically starting from page 45. The following is my code:
y = DGEList(counts = exprs, genes = rownames(exprs))
keep = rowSums(cpm(y) >=0.1) >= 4
y = y[keep, , keep.lib.sizes = FALSE]
y$samples$lib.size = colSums(y$counts)
y = calcNormFactors(y, method = "TMM", lib.size = y$samples$lib.size)
TMM = cpm(y, normalized.lib.sizes = TRUE, log = FALSE)
Tissue = factor(c("control", "treatment", "control", "treatment", "control", "treatment", "control", "treatment"))
Paired = factor(c("Sample1", "Sample1", "Sample2", "Sample2", "Sample3", "Sample3", "Sample4", "Sample4"))
design = model.matrix(~Tissue + Paired)
y = estimateDisp(y, design, robust = TRUE)
fit = glmFit(y, design)
lrt = glmLRT(fit)
o = order(lrt$table$PValue)
cpm(y)[o[1:5],]
I exported the expression profile of top 5 DEGs:
I observed that the first gene, ENSG00000225217, consistently exhibited up-regulation pattern in the control group across four samples. However, the behavior of the remaining four genes appeared irregular. For instance, the second gene, ENSG00000244682, demonstrated up-regulation in the control group in three out of four samples, showing inconsistency across all sample pairs. This leads me to question whether there is an error in my coding approach, or is it due to the unique characteristics?
Thanks!
Thanks,
After updating my formula to '~Paired + Tissue', the results appear to be in order. Nonetheless, I'm still puzzled. You mentioned that 'the testing in edgeR focuses on the last coefficient in the matrix'. Given this, when I change the formula to '~Paired + Tissue', does it continue to conduct a paired-wise test?
The test is forumulated in a different framework to a paired t-test: EdgeR fits linear models, rather than doing t-tests. The forumula you have used tests for whether Tissue accounts for a significant amount of variation after variation due to patient to patient differences have been accountded for. However the two turn out to be equivalent.
Consider the following data:
In a paired t-test we calculate the difference for each sample: The difference for sample 1 is 1, for sample 2 its 0.8 and for sample 3 its 1.2. We take the average and standard deviation of these and ask if its is significantly different form zero. Mean = 1, SE=0.111, df=2
In the linear model with try to find b1,b2,b3,b4 such that
Expression = b1 * Sample1 + b2 * Sample2 + b3*Sample3 + b4 * Treatment
When we isolate b4 to test, we effectively regress out the Sample effects, so that each sample has the same mean. Mean Sample1 = 2.5, Mean Sample2 = 1, Mean Sample 3 = 4.2
To estimate b4 we take the different of the mean of controls (-0.5, -0.4, -0.6)=-0.5 and the means of the treatment (0.5, 0.4, 0.6)=0.5, and ask if this mean difference of 1 is significantly different from 0.
Many thanks to you. Very excellent and clear explanation.