I'm attempting to call SNPs that changed in allele frequency in a population. The population was sequenced at two times (before and after a drought) using pooled-sequencing of the whole genome. Fisher exact tests identified 7 million SNPs, which was comparable to a past study in this species which found 6 million SNPs across the whole genome. When I generate a Manhattan plot using the P values from fishers exact test and a Bonferroni corrected P value for 7M SNPs, about 4k SNPs are significant. This amount seems reasonable considering the population adapted to strong drought in between samplings and that I had several-hundred times coverage of the genome, which was greater than the previous study that found 6M SNPs. However, when I make a qqplot of these P values versus expected P values, most of the observed P values are well above the expected 1:1 line (image below). I calculated the genomic inflation factor lambda to be 2.06 by dividing the median observed chi-squared statistic by the expected median. One of the ways I've came across to correct for this inflation is to divide all observed test statistics by lambda and use the resulting "genomic corrected" P values. However, using this method seems overly conservative and results in only 23 SNPs reaching statistical significance.
I'm interested in any thoughts or suggestions for dealing with the issue of P value inflation. On the one hand, I could probably just filter the 4k P values for those in exons and have a reasonable number to proceed. On the other hand, the qqplot indicates something else is going on that I should probably control for, like population structure. I'm aware of a few methods that can control for population structure in GWAS, but my data is pooled by population and so my understanding is those methods wouldn't apply with out individual-level data. Is there way I can easily correct for P value inflation? Or, do I need to analyze the data with a different approach/toolkit that can take population structure into account?
maybe one and two sided alternatives are the reason?
The fisher test I ran was two sided. The theoretical p values were constructed based on the number of observations for the fisher tests. I'm not sure I see how that could make a difference, can you clarify if you still think this could be the issue?
Your statistic is exactly 2x linearly inflated which makes it very suspicious =)
https://en.wikipedia.org/wiki/Fisher%27s_exact_test
It is difficult to me to judge what was exactly wrong since I do not see the formulas you used, but a linear inflation by 2x factor is certainly some multiplier that was omitted.
Thanks for your suggestion. I ran a fishers exact test with Popoolation2 and then calculated the expected p values for a uniform distribution of the number of SNPs that returned. I've looked back at my code and can't find anything to suggest a multiplier was left out, so I don't think this is the problem so I'll probably have to either thin SNPs (maybe inflated p values was caused by strong selective sweeps) or use a program that can account for population structure.
You have a perfect straight line, perfect linear dependency between two large vectors of values. Neither population structure not inflated p-values cause such perfect correlation, they usually affect only "tails" of distributions.