That's what I understand and how I would do it. First let's create some simulated data of protein level for 4 populations, each population with different mean (i.e. there is stratification that we want to remove):
N<- 100
a<- rnorm(n= N, mean= 1, sd= 2)
b<- rnorm(n= N, mean= 2, sd= 2)
c<- rnorm(n= N, mean= 3, sd= 2)
d<- rnorm(n= N, mean= 4, sd= 2)
dat<- data.frame(prot_lev= c(a, b, c, d), pop= rep(c('a', 'b', 'c', 'd'), each= N))
boxplot(prot_lev ~ pop, data= dat)
To adjust for *population stratification, a linear regression of
protein level on population label was performed
So let's model protein level as a function of the population and get the residuals from the model:
lmreg<- lm(prot_lev ~ pop, data= dat)
dat$prot_lev_res<- lmreg$residuals
boxplot(prot_lev_res ~ pop, data= dat)
mean(dat$prot_lev_res) ## ~ zero as it should be
sd(dat$prot_lev_res) ## ~ 2 as per simulation
the residuals were normalized by transforming the quantiles of the
residual values to their respective quantiles of a N(0,1) distribution
Data has already mean equal to zero, we need to make the stdev equal 1:
dat$prot_lev_qnorm<- lmreg$residuals/sd(lmreg$residuals)
mean(dat$prot_lev_qnorm) ## ~ 0
sd(dat$prot_lev_qnorm) ## ~ 1
boxplot(prot_lev_qnorm ~ pop, data= dat)
Thanks for you reply.Two questions for you : (1)So you mean the author used the residuals from the linear model to represent the protein levels and the residuals were protein levels after removing the population effects ? (2)As you can see in the following code:
And it goes like this:
you can see the (Intercept) , pop and the model significance are greater than 0.05, so is it reasonable that i use the residuals in this model?