Question

Question about identification of DEGs in paired sample using edgeR

0

Entering edit mode

10 months ago

tujuchuanli ▴ 130

I have an RNA-seq dataset comprising four samples, each with corresponding control and treatment, resulting in a total of eight RNA-seq datasets: sample1_control, sample1_treatment, sample2_control, sample2_treatment, and so on. My objective is to identify differentially expressed genes (DEGs) using a pairwise approach in edgeR.

Essentially, I adhered to the guidelines provided in the edgeR manual, which can be found at https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf, specifically starting from page 45. The following is my code:

y = DGEList(counts = exprs, genes = rownames(exprs))
keep = rowSums(cpm(y) >=0.1) >= 4
y = y[keep, , keep.lib.sizes = FALSE]
y$samples$lib.size = colSums(y$counts)
y = calcNormFactors(y, method = "TMM", lib.size = y$samples$lib.size)
TMM = cpm(y, normalized.lib.sizes = TRUE, log = FALSE)
Tissue = factor(c("control", "treatment", "control", "treatment", "control", "treatment", "control", "treatment"))
Paired = factor(c("Sample1", "Sample1", "Sample2", "Sample2", "Sample3", "Sample3", "Sample4", "Sample4"))
design = model.matrix(~Tissue + Paired)
y = estimateDisp(y, design, robust = TRUE)
fit = glmFit(y, design)
lrt = glmLRT(fit)
o = order(lrt$table$PValue)
cpm(y)[o[1:5],]

I exported the expression profile of top 5 DEGs:

enter image description here

I observed that the first gene, ENSG00000225217, consistently exhibited up-regulation pattern in the control group across four samples. However, the behavior of the remaining four genes appeared irregular. For instance, the second gene, ENSG00000244682, demonstrated up-regulation in the control group in three out of four samples, showing inconsistency across all sample pairs. This leads me to question whether there is an error in my coding approach, or is it due to the unique characteristics?

Thanks!

edgeR DEG • 844 views

ADD COMMENT • link 10 months ago by tujuchuanli ▴ 130

score 2 · Answer 1 · 2024-01-08

2

Entering edit mode

10 months ago

i.sudbery 20k

The problem here is that edgeR is finding genes that differ between samples/replicates, rather than genes that differ between treatments. That is because here, edgeR is fitting a model with 5 coefficients - one for each of the samples, and one for the treatment.

I think, by default, the testing in edgeR is on the last coefficient in the matrix. Which here would probably be Sample4 effects. Those you've identified genes that are different in Sample4 to the other samples. Which look about right looking at your results.

There are several ways to specify that you want the treatment effects, but the easiest is probably just to list them last in your design formula.

Thus, instead of your formula being ~Tissue + Paired, use ~Paired + Tissue.

ADD COMMENT • link 10 months ago by i.sudbery 20k

0

Entering edit mode

Thanks,

After updating my formula to '~Paired + Tissue', the results appear to be in order. Nonetheless, I'm still puzzled. You mentioned that 'the testing in edgeR focuses on the last coefficient in the matrix'. Given this, when I change the formula to '~Paired + Tissue', does it continue to conduct a paired-wise test?

ADD REPLY • link 10 months ago by tujuchuanli ▴ 130

2

Entering edit mode

The test is forumulated in a different framework to a paired t-test: EdgeR fits linear models, rather than doing t-tests. The forumula you have used tests for whether Tissue accounts for a significant amount of variation after variation due to patient to patient differences have been accountded for. However the two turn out to be equivalent.

Consider the following data:

                       Expression
               0    1    2    3    4    5
               +----+----+----+----+----+
               |
    Sample 1:  +         x    o               x = Control
               |                              o = treated
    Sample 2:  +  x   o
               |
    Sample 3:  +                 x     o

In a paired t-test we calculate the difference for each sample: The difference for sample 1 is 1, for sample 2 its 0.8 and for sample 3 its 1.2. We take the average and standard deviation of these and ask if its is significantly different form zero. Mean = 1, SE=0.111, df=2

In the linear model with try to find b1,b2,b3,b4 such that Expression = b1 * Sample1 + b2 * Sample2 + b3*Sample3 + b4 * Treatment

When we isolate b4 to test, we effectively regress out the Sample effects, so that each sample has the same mean. Mean Sample1 = 2.5, Mean Sample2 = 1, Mean Sample 3 = 4.2

                      Residual
              -2   -1    0    1    2    
               +----+----+----+----+
                         |
    Sample 1:         x  +  o         x = Control
                         |            o = treated
    Sample 2:          x + o
                         |
    Sample 3:        x   +   o

To estimate b4 we take the different of the mean of controls (-0.5, -0.4, -0.6)=-0.5 and the means of the treatment (0.5, 0.4, 0.6)=0.5, and ask if this mean difference of 1 is significantly different from 0.