Create the correct contrast in limma when want to combine 2 or more groups
2
0
Entering edit mode
4 days ago
Lila M ★ 1.3k

Hi all,

I am running DEA using limma. I have 4 groups (for simplicity A, B, C and healthy (H)). Where my metadata is 'metdata' and I have a column Group for that information and my expression matrix is DA. The idea is to run DEA by comparing: A vs H, B+C vs H and A vs B+C.

Now my question is, should I combine B+C in me model as follow:

design <- model.matrix( ~0 + Groups + Age + Gender, metadata)
abundancesWeighted <- voom(DA, design)
model <- lmFit(abundancesWeighted, design)
model <- contrasts.fit(model, makeContrasts(contrasts = '(GroupB+GroupC)/2-GroupH' levels = design))
model <- eBayes(model)
results <- topTable(model, coef ='(GroupB+GroupC)/2-GroupH' , number = Inf)

Or is better if I modify the metadata by create a new group: GroupMerged in which I only have A, B_C and H and then I run:

design <- model.matrix( ~0 + Groups + Age + Gender, metadata)
abundancesWeighted <- voom(DA, design)
model <- lmFit(abundancesWeighted, design)
model <- contrasts.fit(model, makeContrasts(contrasts = 'GroupB_C-GroupH' levels = design))
model <- eBayes(model)
results <- topTable(model, coef ='GroupB_C-GroupH' , number = Inf)

It seems both approaches are correct, but I want to know if anyone has an stronger opinion about this

Thank you!

contrast limma DEA • 366 views
ADD COMMENT
3
Entering edit mode
3 days ago
Gordon Smyth ★ 7.7k

The general principle is that you should not give false information to limma. If you have determined that groups B and C are not genuinely different, with no truly DE genes between the two groups, then of course you can combine them and treat them as one group. In that case, both approaches would be statistically correct but the second approach would be more statistically powerful than the first approach, because it would remove one redundant coefficient from the linear model fit. If B and C have very unequal sample sizes, then the second approach will be a lot more powerful.

If B and C really are different groups, and there are genes that are DE between them, then the first approach is correct and the second is not. You would be telling limma that there are three groups when there really are four, and there are always consequences for giving false information. The second approach will give larger residual variances and less power because the systematic B vs C differences are included as residual variation and will tend to inflate the variance estimates.

If the sample sizes for B and C are different, then the two appraoches are testing different hypotheses. The second approach will weight B and C according to their sample sizes when forming comparisons whereas the first approach will weight B and C equally (because you specified (B+C)/2 in the contrast). So you have to ask yourself, why are you wanting to use (B+C)/2 in your contrasts? What is your scientific question? Do you genuinely want the average effect of B and C, or do you actually want to weight B and C according to their prevalence in some population? Would you still want to give equal weight to C (say) even if it has a very small sample size?

If your sample sizes are all large and B and C are biologically nearly the same, but not exactly, then it becomes a matter of judgement. It depends on your scientific aims rather than being a programming issue.

ADD COMMENT
0
Entering edit mode

This is a great answer and exactly what I needed :) thank you very much

ADD REPLY
0
Entering edit mode
3 days ago
LChart 4.7k

[Edit] This below answer is wrong but retained for posterity

For a simple linear model as above, the two specifications are not just "both correct:" they are equivalent and should produce the same results, up to numerical optimization differences. In practice I have always preferred the 2nd approach as it makes it trivial to get statistics like the combined group size, combined group percentiles (etc).

ADD COMMENT
1
Entering edit mode

Thanks for answering a limma question, but the two models aren't actually equivalent. The second model has one less model coefficient and one more residual degree of freedom. If there is DE between B and C, or unequal sample sizes, then the two models could give very different results.

ADD REPLY

Login before adding your answer.

Traffic: 1326 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6