Question

linear regression for removing the population effects

2

Entering edit mode

8.4 years ago

LJ ▴ 280

"*To adjust for *population stratification, a linear regression of protein level on population label was performed and the residuals were normalized by transforming the quantiles of the residual values to their respective quantiles of a N(0,1) distribution**."what does this sentence mean? Is it a way to remove the population effect? So if it is ,how can i do this in R?

R linear regression residuals normalization • 3.9k views

ADD COMMENT • link updated 8.4 years ago by dariober 15k • written 8.4 years ago by LJ ▴ 280

score 4 · Answer 1 · 2016-06-03

That's what I understand and how I would do it. First let's create some simulated data of protein level for 4 populations, each population with different mean (i.e. there is stratification that we want to remove):

N<- 100
a<- rnorm(n= N, mean= 1, sd= 2)
b<- rnorm(n= N, mean= 2, sd= 2)
c<- rnorm(n= N, mean= 3, sd= 2)
d<- rnorm(n= N, mean= 4, sd= 2)
dat<- data.frame(prot_lev= c(a, b, c, d), pop= rep(c('a', 'b', 'c', 'd'), each= N))
boxplot(prot_lev ~ pop, data= dat)

To adjust for *population stratification, a linear regression of protein level on population label was performed

So let's model protein level as a function of the population and get the residuals from the model:

lmreg<- lm(prot_lev ~ pop, data= dat)
dat$prot_lev_res<- lmreg$residuals

boxplot(prot_lev_res ~ pop, data= dat)

mean(dat$prot_lev_res) ## ~ zero as it should be 
sd(dat$prot_lev_res)   ## ~ 2 as per simulation

the residuals were normalized by transforming the quantiles of the residual values to their respective quantiles of a N(0,1) distribution

Data has already mean equal to zero, we need to make the stdev equal 1:

dat$prot_lev_qnorm<- lmreg$residuals/sd(lmreg$residuals)

mean(dat$prot_lev_qnorm) ## ~ 0
sd(dat$prot_lev_qnorm)  ## ~ 1

boxplot(prot_lev_qnorm ~ pop, data= dat)

score 1 · Answer 2 · 2016-06-01

1

Entering edit mode

8.4 years ago

Shab86 ▴ 310

Population stratification means any differences you might find in the allele frequencies within a population due to different ancestry. This could be due to non-random mating or admixture of populations in the past. This can be a problem for GWAS, where association is due to the underlying population structure and not a disease-associated locus.

Some papers to help you out:

http://genepath.med.harvard.edu/~reich/Reich%20and%20Goldstein.pdf (One of the earlier works)
http://www.hindawi.com/journals/ijg/2015/501617/ (example of how pop strat affects GWAS)
http://www.nature.com/nrg/journal/v11/n7/full/nrg2813.html

ADD COMMENT • link 8.4 years ago by Shab86 ▴ 310

1

Entering edit mode

Thanks for your reply. But i'm not doing GWAS,what i need is: i have a 3000 genes expression data in 100 samples,and i konw the samples population.So how do i remove the population effects to normalize the expression data using the linear regression in R? What i need finally is the 3000 expression data in these samples after population effects removal,and i plan to use the data to do genes co-varying.So how do i normalize the data in R just like the sentence said?

ADD REPLY • link 8.4 years ago by LJ ▴ 280

1

Entering edit mode

Maybe you can try and input the population as a categorical variables in the analysis. Not sure if you are working on RNA Seq but DESeq2 got a very nice tutorial on how to perform such analysis. You can find it here

ADD REPLY • link 8.4 years ago by Sam ★ 4.8k

score 1 · Answer 3 · 2016-06-02

1

Entering edit mode

8.4 years ago

LJ ▴ 280

nobody has answer???

ADD COMMENT • link 8.4 years ago by LJ ▴ 280

1

Entering edit mode

You got one answer, but it was not what you asked. Maybe that means that the question was not clearly phrased?

ADD REPLY • link 8.4 years ago by Benn 8.3k