I have two cohorts (of differing sizes, one ~2300 and one ~130 people) and have calculated logistic models of the presence/absence of a genetic mutation between the two groups (corrected for some nuisance factors):
Variant ~ Cohort + other
where Variant is a binary variable (whether the variant is present 1 or absent 0) and Cohort is a binary variable (0 for reference, 1 for alternative), using the R code
formula <- as.formula(paste(var[i],' ~ cohort + other')
output <- glm(formula, data = combined.table, family = binomial(link=logit))
p.val[i] <- summary(output)$coefficients[2,4];
(this is a snippet from the code: - var[i]
is the variant variable (1 or 0), cohort
is the cohort variable (1 or 0) and other
is the other factor.)
I plotted a qqplot of the resulting ordered p-values against a null hypothesis that the p-values are uniformly distributed using the gaston
package in R (https://rdrr.io/cran/gaston/man/qqplot.pvalues.html
). (The method is similar to that in figure 2A of this paper: https://pubmed.ncbi.nlm.nih.gov/32396860/)
However, unlike the figure 2A in the paper, my qqplot appears skewed with the observed p-values being too small:
Is this a problem with the model I'm trying to fit or how I'm trying to fit it?
I don't know exactly your goal, but this Q-Q plot is pretty inflated to me. You can check if there's something like population stratification in your data causing this. Just a thought!
How would I check for this population stratification? There was some minimal filtering of the inputs before getting to this stage, but even without this filtering, the graphs look similar