I'm running a GWAS for the first time and modeling my analysis based on this paper, since the nature of our data is similar. The analysis uses GEMMA to apply a linear mixed model and includes the top five principle components as covariates. However, when I ran the analysis and generated a QQ plot, it showed substantial evidence of population structure (early divergence from the expected line).
To correct the population structure issues, I tried using the top 20 principle components instead of the top 5. I also tried pruning for linkage disequilibrium prior to generating the principle components. Neither of these significantly altered the QQ plot.
The only other thing I can think of is to try correcting my phenotypes for additional covariates. I'm using a fly model and so I could correct for Wolbachia infection status and the presence of chromosomal inversions (thought none of these are individually associated with my phenotype). However, since my phenotype is binary (case/control), the residuals from fitting a logistic model look strange. The phenotypes start out as 1's and 0's but end up looking more like a continuous variable that includes some negative numbers and ranges from -2 to 5. This would cause GEMMA to treat it like a quantitative phenotype instead of binary and potentially disrupt the analysis.
Does anyone have advice for how I might proceed?
Edit: One option that just occurred to me is to just include the additional covariates (Wolbachia/inversions) in the covariates file along with the eigenvectors generated by PCA. However the fact that all the other papers I've seen with this model have used corrected phenotypes (i.e., covariates regressed out) rather than including them in the covariates file makes me think that perhaps there's some kind of problem with this approach?
Hi Kevin, thanks so much for this helpful information. I'm still a bit confused about how I should decide how many eigenvectors to include based on the bi-plot. I don't have any categorical data (such as ethnicity, as in your example) that I can use to color-code my plots. I used R to plot the first component vs. components 2, 3, 4, and 5, but I'm not sure what I'm looking for in these plots to decide whether or not the component should be included.
The standard is to include just 2 eigenvectors, i.e., to correct or population structure. If your sample cohort is entirely from the same region, however, then you may not even have to correct for population structure.
Ah I see. That's strange because for all other papers I've seen using this model set (205 inbred Drosophila lines derived from the same geographical area) always include 5-20 principle components as covariates. In my case, I fitted a logistic regression to the first 5 pc's and determined that none were significantly associated with my phenotype. However, in the paper that I'm basing my analysis on (linked in my original post) they also determined that none of the pc's were associated with their phenotype, yet they still included them as covariates. Perhaps I should try running the analysis without any pc covariates at all?
Just dawned on me that you are studying Drosophila spp., here the population differences are undoubtedly different depending on how they are bred.
I don't know the specifics of your own study cohort but, if you found that none of the first 5 PCs were statistically significant, then none of the subsequent PCs are likely to be. You can try without PC covariate adjustment to see how it affects the QQ plot.
Since none of the pc's or other covariates were statistically significant, I ran the analysis again without any covariates. The QQ plot from that data had a strange step-shaped pattern: https://imgur.com/a/t9Zu7
Out of interest, can you show the other QQ plots? Obviously, a QQ plot does not have to perfectly follow the expected distribution (red line) - the deviations from the expected distribution can be reflective of the differences for which you are looking, i.e., between cases and controls