Your questions are not a nuisance, so, do not feel bad for asking.
In association studies, the usual focus at each SNP position is the minor allele, i.e., the SNP allele that has the lowest frequency in the samples being studied in your dataset - I am assuming that you know this? At some genotyped sites, the minor allele may have a frequency (i.e. minor allele frequency - MAF) of 49% compared to 51% for the major allele, which is less interesting because, with a frequency of 49%, it is seen as a 'common' variant. At others, however, the minor allele may have a MAF of just 1%, which classes it as a 'very rare' variant (MAF 5% is usually the cut-off for rare / non-rare). Important to note, however, that both common and rare variants can be functional and have roles in disease. For further reading, read: Rare and common variants: twenty arguments.
In any case, if we just take the most basic type of association test and tabulate the number of minor and major alleles in our cases and controls, we can get an example 2x2 contingency table like this:
contingency.table
Cases Controls
Minor allele 27 6
Major allele 73 94
You can see that the minor allele is more frequent in the cases for this particular SNP. We can easily derive a 1 degree of freedom Chi-square p-value for this in R Programming Language:
chisq.test(contingency.table)
Pearson's Chi-squared test with Yates' continuity correction
data: contingency.table
X-squared = 14.516, df = 1, p-value = 0.0001389
Not genome-wide significance at all, but this is only a 100 sample dataset as an example.
We can then derive an odds ratio (OR) for the minor allele:
(27/6) / (73/94)
[1] 5.794521
Standard error of OR:
sqrt((1/27) + (1/6) + (1/73) + (1/94))
[1] 0.477536
Upper 95% confidence interval (CI) of the OR
5.794521 * exp(1.96 * 0.477536)
[1] 14.77421
Lower 95% CI of the OR:
5.794521 * exp(- 1.96 * 0.477536)
[1] 2.27264
With all of this useful information, we can then also calculate the Z-score. The Z-score is the log of the OR (log.OR) divided by the standard error of log.OR (SE.log.OR). The SE.log.OR calculation involves both the OR and the lower CI of the OR:
log.OR <- log(5.794521)
lower95.log.OR <- log(2.27264)
SE.log.OR <- (log.OR - lower95.log.OR) / 1.96
Then calculate Z:
log.OR / SE.log.OR
[1] 3.679121
----------------------------------------------------------------
Another way to calculate p-values, ORs, and Z-scores in association studies is through logistic regression analysis. In regression, one can encode the genotypes as categorical variables or, usually, numerical variables in 'additive' models. In these cases, one has the following:
- homozygous minor allele = 2
- heterozygous minor allele = 1
- homozygous major allele = 0
One can also adjust for covariates in these models, such as smoking status, BMI, ethnicity and/or PCA eigenvectors, etc. From regression, the OR is the exponent of the estimate, and the Z-score (if not explicitly given) can be calculated in the same way as above. I built a pipeline for a complex type of trios family analysis using these types of metrics and conditional logistic regression (where cases and controls are matched into strata): GwasTriosCLogit
---------------------------------------------
If you are wondering from where I magically got 1.96 and used it in the calculations, then look HERE.
This example is to just give you a fundamental understanding of what is going on 'behind the scenes' in association studies. Obviously there are many dozens of types of analyses that involve different statistical tests, and programs like PLINK, etc, are undoubtedly doing further adjustments to the data than I have shown here.
Kevin
We used to use <0.01% (not more than 1 in 10,000 alleles) as a rare variant cutoff. IMO 1% is common, as you're seeing the variant in 1 in 100 alleles, which could be as low as 1 in 50 people.