What Is The Math Proof That A The Beta Of A Linear Regression Is Equal To The Log2 Fold Change In Microarrays ?
2
5
Entering edit mode
11.2 years ago

Hi,

I have found that some programs use the beta of a linear regression ( gene_exp ~ pheno ) to provide the log2 fold change (e.g. function topTable from limma R package) by looking into the scripts and testing, but I did not found any documentation related to this. So maybe I am right maybe not.

Would it be possible to have some references about this or somebody show me the math proof of that ?

microarray • 9.6k views
ADD COMMENT
5
Entering edit mode
11.2 years ago
David W 4.9k

This is only true if the response was log2-transformed prior to running the model.

The easiest way to think through is probably a toy example.

 set.seed(123)
 x <- c(rpois(50,100), rpois(50,75))
 y <- rep(factor(c("A", "B")), each=50)

That's 50 observations for each of 2 phenotypes, with the "true" fold difference between B and A being 75/100 = 3/4 ~ 2^-0.415.

If you fit a linear model the betas are going to tell you the mean of each group:

(betas <- lm(x~y)$coefficients)
# (Intercept)          yB 
#     100.06      -23.68 
(means <- tapply(x,y,mean))
#    A      B 
#100.06  76.38 
betas[1] + betas[2]  == means[2]
# TRUE

Obviously, the betas are not he same as the log2 fold-change. To get that you can either transform the ratio of the estimates

log(sum(betas)/betas[1], 2) 
# -0.3895985

Or perform the transformation before the model-fitting

 lm(log(x,2)~y)$coefficients
 # (Intercept)     yB 
 # 6.639029   -0.389812

EDIT

I forget to explain the math-sy reason why this works, which might not be immediately obvious. As we've seen, when we do a linear regression with a categorical predictor, the Beta values reflect difference in the mean value between groups. If we first log-transform the response values then, of course, we'll end up with a difference of logs. In the example that's log(100) - log(75) which, thanks to the magic of logs, is the same thing as log(100/75): the log fold-difference.

EDIT 2

To clarify about the small difference between taking the log of the ratio of betas, rather that first log-transforming the values. This arises because the log-transform also changes the shape of the distrbution and therefore the mean value. With the toy example, the means of the log transformed x (tapply(log(x,2), y, mean)) is slightly different than log-transform of the means of x (log(tapply(x,y,mean),2))

ADD COMMENT
1
Entering edit mode

Thanks for your answer. So If I understood, you have 2 ways of calculating log2 fold change with betas:

1) In your exemple, the fold change is FC= mean( x( B ) ) / mean( x( A ) ). Though FC = 76.38/100.06 = 0.763342. And log2( FC ) = -0.3895985 = log(sum(betas)/betas[1], 2) .

2) If you log2 transform x before model fitting which is the case of gene expression usually after normalisation and summarization:

log2x <- log2( x ), mean( log2x( B ) - log2x( A ) ) = -0.389812 = yB.

So in that case you have the log2( FC ) = beta.

How do you explain the small difference between -0.3895985 and -0.389812 ? Are they caused by some approximation ?

ADD REPLY

Login before adding your answer.

Traffic: 1992 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6