linear regression for removing the population effects
3
2
Entering edit mode
8.5 years ago
LJ ▴ 280

"*To adjust for *population stratification, a linear regression of protein level on population label was performed and the residuals were normalized by transforming the quantiles of the residual values to their respective quantiles of a N(0,1) distribution**."what does this sentence mean? Is it a way to remove the population effect? So if it is ,how can i do this in R?

R linear regression residuals normalization • 3.9k views
ADD COMMENT
4
Entering edit mode
8.5 years ago

That's what I understand and how I would do it. First let's create some simulated data of protein level for 4 populations, each population with different mean (i.e. there is stratification that we want to remove):

N<- 100
a<- rnorm(n= N, mean= 1, sd= 2)
b<- rnorm(n= N, mean= 2, sd= 2)
c<- rnorm(n= N, mean= 3, sd= 2)
d<- rnorm(n= N, mean= 4, sd= 2)
dat<- data.frame(prot_lev= c(a, b, c, d), pop= rep(c('a', 'b', 'c', 'd'), each= N))
boxplot(prot_lev ~ pop, data= dat)

To adjust for *population stratification, a linear regression of protein level on population label was performed

So let's model protein level as a function of the population and get the residuals from the model:

lmreg<- lm(prot_lev ~ pop, data= dat)
dat$prot_lev_res<- lmreg$residuals

boxplot(prot_lev_res ~ pop, data= dat)

mean(dat$prot_lev_res) ## ~ zero as it should be 
sd(dat$prot_lev_res)   ## ~ 2 as per simulation

the residuals were normalized by transforming the quantiles of the residual values to their respective quantiles of a N(0,1) distribution

Data has already mean equal to zero, we need to make the stdev equal 1:

dat$prot_lev_qnorm<- lmreg$residuals/sd(lmreg$residuals)

mean(dat$prot_lev_qnorm) ## ~ 0
sd(dat$prot_lev_qnorm)  ## ~ 1

boxplot(prot_lev_qnorm ~ pop, data= dat)
ADD COMMENT
0
Entering edit mode

Thanks for you reply.Two questions for you : (1)So you mean the author used the residuals from the linear model to represent the protein levels and the residuals were protein levels after removing the population effects ? (2)As you can see in the following code:

a<-rnorm(100)
b<-rnorm(50)
c<-rnorm(30)
dat<-data.frame(prot_lev= c(a, b, c), pop=c(rep(0,100),rep(1,50),rep(2,30)))
lmreg<- lm(prot_lev ~ pop, data= dat)
summary(lmreg)

And it goes like this:

Call: lm(formula = prot_lev ~ pop, data = dat)

Residuals: Min 1Q Median 3Q Max -2.24687 -0.66452 0.02218 0.76330 2.24357

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.07544 0.09237 0.817 0.4152
pop -0.16256 0.09505 -1.710 0.0889 .

--- Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9636 on 178 degrees of freedom

Multiple R-squared: 0.01617, Adjusted R-squared: 0.01064

F-statistic: 2.925 on 1 and 178 DF, p-value: 0.08894

you can see the (Intercept) , pop and the model significance are greater than 0.05, so is it reasonable that i use the residuals in this model?

ADD REPLY
1
Entering edit mode
8.5 years ago
Shab86 ▴ 310

Population stratification means any differences you might find in the allele frequencies within a population due to different ancestry. This could be due to non-random mating or admixture of populations in the past. This can be a problem for GWAS, where association is due to the underlying population structure and not a disease-associated locus.

Some papers to help you out:

  1. http://genepath.med.harvard.edu/~reich/Reich%20and%20Goldstein.pdf (One of the earlier works)
  2. http://www.hindawi.com/journals/ijg/2015/501617/ (example of how pop strat affects GWAS)
  3. http://www.nature.com/nrg/journal/v11/n7/full/nrg2813.html
ADD COMMENT
1
Entering edit mode

Thanks for your reply. But i'm not doing GWAS,what i need is: i have a 3000 genes expression data in 100 samples,and i konw the samples population.So how do i remove the population effects to normalize the expression data using the linear regression in R? What i need finally is the 3000 expression data in these samples after population effects removal,and i plan to use the data to do genes co-varying.So how do i normalize the data in R just like the sentence said?

ADD REPLY
1
Entering edit mode

Maybe you can try and input the population as a categorical variables in the analysis. Not sure if you are working on RNA Seq but DESeq2 got a very nice tutorial on how to perform such analysis. You can find it here

ADD REPLY
1
Entering edit mode
8.5 years ago
LJ ▴ 280

nobody has answer???

ADD COMMENT
1
Entering edit mode

You got one answer, but it was not what you asked. Maybe that means that the question was not clearly phrased?

ADD REPLY

Login before adding your answer.

Traffic: 1533 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6