In GWAS, what it means by 'single SNP association studies only explain a small part of disease heritability'? How this explained heritability is quantified?
In GWAS, what it means by 'single SNP association studies only explain a small part of disease heritability'? How this explained heritability is quantified?
In statistics, when people say "17% of the variability is explained by X Y and Z", they are referring to the proportion of the variance that can be accounted for by the predictors in the statistical model.
For example, if you did a big association study on the genetics of lung cancer. You would need to include smoking as a covariate in the model. Why? Well, 1) this will prevent you from mis-attributing cancer that is actually due to smoking to the person's genetics instead. But, your question touches on a second reason. 2) When you include smoking as a covariate you can "partial out" or more colloquially "explain away" some of the variance in the data. This means there is less total variation left over in the dataset as a whole, or, "less variance left to explain". Still confused? I'll keep yammering.
Complicated phenomena like cancer are hard to predict because they have many distinct motivators. To predict cancer incidence perfectly, you would have to know about all of the predictors. But if you don't, your predictive model might succeed in explaining only 50% of the phenomenon (for example). A statistician might say about this, that, "the model accounts for 50% of the variance in the dataset." In the specific context in which you are speaking, you might phrase this as "only 15% of the heritability of the disease can be explained by known genetic risk factors"...
Numerically, what is going on has to do with ratios of sums of squares. Generally, you end up with the total variance of the data. That goes on the bottom as the denominator. The amount of the variance that the model explains is the numerator, which is the amount of the total sums of squares (SS) that your model explains.
So if you had
(explained SS) = 10000
(unexplained SS) = 40000
the ratio of explained to total would be 10000/40000 = 0.25, and your model would "explain" 25% of the variance.
It is not exactly correct to say this has to apply (only) to complex polygenic conditions, though it is likely.
As a counterexample, consider CFTR. If mutated in certain ways, the person may develop cystic fibrosis. Now, so far, we have found >1200 mutations that can lead to that phenotype or something like it... but imagine that we had only found 300 of these mutations to date.
Despite that these variants are all in the same gene, you could still enter these 300 variants into a statistical model (e.g. general linear model) as predictors, you might be able to account for 43% of the variance in the data. This would relate directly to the ratio of sums of squares mentioned earlier.
Hope this helps.
This isn't really a bioinformatics question, it's more one of basic genetics.
Anyway, most traits/diseases/etc. aren't monogenic. The probability of having them depends on a relatively large number of changes all interacting in often unknown ways. Thus, you may be able to associate a single one of those changes (a SNP, for instance) with the disease/whatever of instance, but that will only explain a small portion of how the disease/whatever is inherited since there are often many other SNPs. BTW, for the quantification, this comes directly from the statistics.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks for the explanation. This is very clear.