I'm doing a GWAS using ~15 million variants and ~800 people. I am unfamiliar with Linux, so I have tried using PLINK MDS and PCA functions to obtain principal components to be used as covariates in the association analysis to control for population stratification. When I plotted the p-values (QQ plot) obtained from the association analysis, the distribution was pretty messy, suggesting that I did not adequately control for population stratification. I took the following steps:
- Pruned based on LD using PLINK --indep
Created a genome file:
./plink --bfile file --genome --extract plink.prune.in
Used --pca to generate an eigenvec file containing PCs
./plink --bfile gendep_merged --cluster --pca header --extract plink.prune.in --read-genome plink.genome
Performed the association analysis using 10 PCs from the eigenvec file as covariates:
./plink --bfile file --pheno phenotype.txt --allow-no-sex --covar plink.eigenvec --covar-name PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 --out association --linear --adjust
Am I missing a step or should any of the flags used by modified in order to produce PCs that will adequately control for population stratification in this sample?
Any input would be greatly appreciated.
Using 10 does indeed seem a bit excessive. You should only use the PCs that actually stratify your population. If that's none of them, then do not include any.