Question

Does PRS distribution of target data should follow gaussian distribution?

2

Entering edit mode

5 months ago

bubgoose ▴ 20

Hello,

I'm new to gwas study and I just came across the question while reading this tutorial paper.

PRS distribution

The central limit theorem dictates that if a PRS is based on a sum of independent variables (here, SNPs) with identical distributions, then the PRS of a sample should approximate the normal (Gaussian) distribution. This is true even if the PRS has extremely low predictive accuracy, since the sum of random numbers is approximately normally distributed, and so a normally distributed PRS in a sample should not be considered as validation of the accuracy of a PRS or of the liability threshold model. However, strong violations of these assumptions, such as the use of many correlated SNPs or a sample of heterogenous ancestry (thus, SNPs with markedly different genotype distributions), can lead to non-normal PRS distributions. Thus, inspection of PRS distributions may highlight calculation errors or problems of population stratification in the target sample for which researchers did not adequately control.

It says PRS distribution usually follow the Faussian distribution but I wonder why it does so. If target data consists of two phenotype group which can be nicely distinguished by PRS, I think the PRS distribution in target data can seems like mixture of two Gaussian distribution.

Please someone explain me if I get it wrong.

Thank you.

Polygenic-risk-score PRS GWAS • 721 views

ADD COMMENT • link updated 5 months ago by Ram 44k • written 5 months ago by bubgoose ▴ 20

1

Entering edit mode

I'm not following your question. just because you have two phenotype groups in mind (e.g. professional basketball players vs others) doesn't mean the PRS for height isn't normally distributed for people in general.

ADD REPLY • link 5 months ago by Jeremy Leipzig 22k

0

Entering edit mode

When it comes to real world case, you are right. But the problem is, if i have target data which contain 1000 professional basketball player and 1000 others, the PRS(of height) distribution must seems bimodal distribution. (if PRS can obviously distinguish the phenotype)

ADD REPLY • link 5 months ago by bubgoose ▴ 20

score 4 · Accepted Answer · 2024-06-20

This is to do with the assumptions behind a PRS.

Assumptions of a PRS are that

The phenotype is a linear combination of small effects from many SNPs
The genotype at each SNP is independent of that at other SNPs.

Under these assumption PRS can be mathematically demonstrated to approximate a normal distribution.

So what is happening in a situation where a phenotype is clearly not normally distributed. There are three possiblities:

The phenotype is being affected by environmental effects that are not normally distributed.
The effects of the SNPs are not additive.
The SNPs are not independent of each other in the population.

In case 1, the PRS is still predicting genetic risk (which would still be normally distributed), its just that phenotype is a combination of phenotype and environment. However, in the cases 2 and 3, the assumptions for PRSs are violated, and the PRS will not work accurately.

What is the most likely explaination in the example given in the comments (1000 professional basket ball players vs 1000 others). While both 1 and 2 might apply, we know that height PRSes are normal within a general population, so the biggest effect is likely to be 3 - that the genotypes at each SNP are not independent. You can look at this in several ways. Firstly you can argue that your population is not homogeneous (PRS only work on homogenous populations) - its possible that your professional players have a different ancestry than those that are not players (in fact they almost certainly do - each person has their own ancestery, and if height is genetic then a tall person must have a tall ancestery, but perhaps you are selecting average ancesteries that are different). But this is not neccessary. Even within an otherwise homogenous popualtion by selecting 1000 people that are tall to put in the player catagoriy you are inducing correlation between height related SNPs and making your population non-homogenous (or to put it another way, your sample is not representative of the population it is a sample from).

All these arguements are exactly the reason that PRS only work in homogeneous popualtions (which almost never exist in reality), and even then, only a population similar to the one they were derived in.

score 2 · Accepted Answer · 2024-06-20

2

Entering edit mode

5 months ago

Sam ★ 4.8k

You are correct. When the polygenic risk score are very predictive, it is possible to see a bi-modal normal distribution. We also see that when say the number of variants used to construct the PRS is small, or when there are population stratification (using EUR GWAS for AFR samples, as there are problem with the LD structure, leading to non-independent signals).