I am currently working on running a GWAS using GEMMA to apply a linear mixed model, and I've been experiencing some difficulties with my analysis. Specifically the QQ plot has shown some early deviation from the expected line, as I described in a previous question. I also recently noticed that the log file from GEMMA reports the PVE under the null model is roughly 1e-5, with a standard error estimate of 0.2. I am confused how to interpret this. Does this mean that my phenotype is almost entirely environmental rather than genetic? Or does it mean the opposite, because that's the estimate under the "null model"?
The extremely low pve estimate (and comparatively large standard error) makes me think that there may be something wrong in my analysis setup. An additional note: If I omit the covariates file (containing the top 5 principle components), the QQ plot ends up having a strange wavy-looking shape, further suggesting that something is amiss here. I've included some extra details about my analysis below. If anyone has any suggestions on what I might be doing wrong, I'd greatly appreciated it.
This is the command I used:
gemma -bfile myfile_clean -k myfile_cXX.txt -lmm 2 -c covariates.txt -o myfile
The p-value estimates are based on a likelihood ratio test. I used GEMMA to estimate the centered genetic relationship matrix after LD pruning, and also included the eigenvectors for the top 5 principle components (generated by GCTA software) as covariates. None of these principle components were significantly associated with my phenotype (based on a logistic regression) but I included them as covariates anyway, and as I said omitting them causes a weird wavy-looking QQ plot. I'm using a Drosophila model (150 inbred lines) and determined that Wolbachia infection status and major chromosomal inversions are not associated with my phenotype, so I didn't include them as covariates. When I ran it again with them included as covariates in addition to the principle components, the QQ plot looked basically the same, as well as the pve estimate.
Also, my phenotype is binary (case/control) and I've dealing with an unbalanced dataset (30 cases, 120 controls). When I tried balancing it by only including 30 randomly-selected controls, the resulting QQ plot showed the data being below the expected line, rather than above, and the pve was still very low. I'm at a loss for what could be causing these problems.
How is the kinship matrix: Do you see values between 0.3-0.4 diagonally?
Also, if you're using kinship you might not want to use PCs as covariate.
Yes, they seem to all be around 0.25-0.35. I tried it without the PCs included as covariates, but as the original post states, this caused the resultant QQ plot to have a strange wavy-looking shape, and I'm not sure why.
Hi Bioinformatics_NewComer, thanks for contributing - the more eyes, the better!
Given the low case numbers, I think that we will always struggle here! In light of this, the low PVE with high standard error makes sense; although, I say this whilst not knowing exactly how GEMMA calculates PVE (but whilst having used other measures of variation contribution on other data).
Did you ever try just including 1-3 PCs?; did you ever try just the Wald test (
lmm 1
) instead of likelihood ratio test (lmm 2
)?Here's what the QQ plot looks like for the Wald test when including 5 PCs: https://imgur.com/a/KjMV6 It's still not great but a bit better, though the p-values are quite a bit larger. I'll give it a try with fewer PCs and let you know the results...
Update: The results from using 2 PCs look basically the same. Unfortunately I think I'm probably just going to have to note the likely existence of bias in my data due to the experimental design. For the purposes of my undergraduate thesis at least, this will not be a major problem. Thanks to both of you for the advice.
hmm, like kevin mentioned - give try a with lmm 1 (wald test)
Second, how is the quality of SNPs you're including? Run the analysis by excluding rare SNPs.
See my responses to Kevin above. I pre-filtered my genotype data by maf > 1% and missingness <15%.
Ah OK.
Check the lambda too. See if you can use Metal to reduce the inflation using genomic control flag.
See if you can run the analysis with missingness <=5%