Question

PRS regression and covariates

2

Entering edit mode

4.6 years ago

Silvia ▴ 20

Hi all,

I have a quick question. In a seminar talk by Cathryn Lewis (professor of Genetic Epidemiology and Statistics at King's College London), she said that when you run a regression using Polygenic Risk Score (PRS) as your Independent Variable (IV) predicting your phenotype of interest (Dependent Variable - DV), you can add covariates to control for. These are normally principal components, genotyping batch, anything that it is correlated with your PRS, as opposed to covariates, such as age and IQ, that are correlated with the phenotype (i.e., DV) instead.

I was looking for a paper to check that I actually understood properly and to use as a reference instead of the seminar talk, but I couldn't find any. So I was wondering if you could confirm that it is common practice to correct for covariates associated with your genetic IV only, and/or if you have a paper to recommend.

Thanks so much, Silvia

PRS regression covariate • 4.4k views

ADD COMMENT • link updated 2.3 years ago by Sam ★ 4.8k • written 4.6 years ago by Silvia ▴ 20

1

Entering edit mode

From what I understood, these (i.e., the PC covariates, age, BMI, etc) are more typically included when deriving the polygenic risk scores (PRS) themselves; so, the model would be:

glm(outcome ~ SNP + PC1 + PC2 + Age^2 + BMI)

The PRS is then typically constructed from the beta coefficient for the SNP from this model. I see no further need to adjust for covariates when using these scores elsewhere, due to the fact that the covariates are already 'absorbed' into the scores.

Examples:

It all depends on how the PRS was calculated in the first place. Remember that 'risk score' is a general term with no clear definition. If you ask two Professors of Statistics 'What are risk scores?', they will give different answers.

ADD REPLY • link 4.3 years ago by Kevin Blighe 88k

score 0 · Answer 1 · 2020-04-14

0

Entering edit mode

4.6 years ago

Sam ★ 4.8k

PRS is calculated as

sum of (effect size * number of effective allele in subject)

And usually, we will perform some form of optimization to decide which SNPs to be included in the data / how much we are going to shrink the effect size (you can read more about thresholding and shrinkage here). To select an "optimal" parameter, we usually perform a regression of the calculated PRS against the phenotype of interest, and select a parameter that generates a PRS that is most associated with the phenotype of interest.

If you imagine the PRS as a poor proxy of thegenetic liability of an individual, then the aforementioned regression can be roughly presented as

Y ~ G + E

where Y is the phenotype and G is the genetic proxy, which is almost identical to the GWAS equation which has SNP dosage as G. Similar to GWAS analysis, confounders such as population stratification can lead to bias in this regression and should therefore be adjusted.

ADD COMMENT • link 4.6 years ago by Sam ★ 4.8k

0

Entering edit mode

Thanks for chipping in, Sam.

ADD REPLY • link 4.6 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you both! I have corrected for population stratification and batch effects as these might bias the PRS as you said, so now I am confident I did it right. However, I am still a bit doubtful if I need to control for variables associated with my phenotype (e.g., IQ). On the one hand, I have this seminar talk in which it is said to control only for variables associated with your genotype (i.e., PC, batch effects etc) and not your phenotype (e.g., IQ), but looking at methods of relevant papers, there are a lot of inconsistencies. Some controls for covariates associated with the phenotype under investigation and others don't.

ADD REPLY • link 4.6 years ago by Silvia ▴ 20

0

Entering edit mode

It is similar to GWAS. If your trait is binary, then adding too much covariate does have a negative impact to your power (there's a paper on it, but I don't remember the name). However, in most other cases, you usually want to adjust for the relevant covariates, e.g. for height, you would always adjust for Age and Sex, even-though age might not necessary associated with your genotype in a way that is meaningful for height. Similarly, for IQ, people usually wants to adjust for Social economical status and education attainment as those are stuff that can have an impact on your IQ.

ADD REPLY • link 4.6 years ago by Sam ★ 4.8k

0

Entering edit mode

Hi Silvia and Sam, I am facing a similar quandary with my hypothesis. If i understood Silvia correctly, the question was whether or not to include all relevant covariates (including the ones that affect the PRS and others that are relevant to the primary outcome / dependent variable) in the same model, correct?

Y ~ PRS + PCs + Genotyping Batch + age + sex + etc. for height as Y.

versus:

Y1 = residuals of Y ~ age + sex + etc. for height) then Y1 ~ PRS + PCs + Genotyping Batch

Could you please let me know what you landed on?

ADD REPLY • link 4.3 years ago by vkpilla • 0

0

Entering edit mode

You can do both. Though for the second option, we usually regress all covariates instead of leave some out at the end. The problem of the second approach is that the R2 and p-value will be slightly off if there are correlation between the independent variable (PRS) of interest and the covariate, but the result should usually be close enough.

ADD REPLY • link 4.3 years ago by Sam ★ 4.8k

0

Entering edit mode

Thanks, the above is really helpful. I also have a related question - would the below be ok?

Y ~ (residuals of PRS ~ PCs + Genotyping Batch) + age + sex + education etc. for cognitive performance/IQ.

ADD REPLY • link 2.3 years ago by Elizabeth • 0

0

Entering edit mode

Why do you want to do that? Won't it be easier to just include the PCs and Genotyping batch as a covariates for the full equation? If you are concern about population structure, one possible approach will be to standardize the PRS with the some form of PC projection (from the broad institute, I might have the code, will need to find it though). Otherwise, this just make the results a lot more difficult to interpret.

ADD REPLY • link 2.3 years ago by Sam ★ 4.8k