Dealing with P value inflation of SNPs identified by qqplot
0
0
Entering edit mode
3.2 years ago

I'm attempting to call SNPs that changed in allele frequency in a population. The population was sequenced at two times (before and after a drought) using pooled-sequencing of the whole genome. Fisher exact tests identified 7 million SNPs, which was comparable to a past study in this species which found 6 million SNPs across the whole genome. When I generate a Manhattan plot using the P values from fishers exact test and a Bonferroni corrected P value for 7M SNPs, about 4k SNPs are significant. This amount seems reasonable considering the population adapted to strong drought in between samplings and that I had several-hundred times coverage of the genome, which was greater than the previous study that found 6M SNPs. However, when I make a qqplot of these P values versus expected P values, most of the observed P values are well above the expected 1:1 line (image below). I calculated the genomic inflation factor lambda to be 2.06 by dividing the median observed chi-squared statistic by the expected median. One of the ways I've came across to correct for this inflation is to divide all observed test statistics by lambda and use the resulting "genomic corrected" P values. However, using this method seems overly conservative and results in only 23 SNPs reaching statistical significance.

qqplot image

I'm interested in any thoughts or suggestions for dealing with the issue of P value inflation. On the one hand, I could probably just filter the 4k P values for those in exons and have a reasonable number to proceed. On the other hand, the qqplot indicates something else is going on that I should probably control for, like population structure. I'm aware of a few methods that can control for population structure in GWAS, but my data is pooled by population and so my understanding is those methods wouldn't apply with out individual-level data. Is there way I can easily correct for P value inflation? Or, do I need to analyze the data with a different approach/toolkit that can take population structure into account?

pooled SNP inflation calling factor sequencing genomic • 3.6k views
ADD COMMENT
0
Entering edit mode

maybe one and two sided alternatives are the reason?

ADD REPLY
0
Entering edit mode

The fisher test I ran was two sided. The theoretical p values were constructed based on the number of observations for the fisher tests. I'm not sure I see how that could make a difference, can you clarify if you still think this could be the issue?

ADD REPLY
0
Entering edit mode

Your statistic is exactly 2x linearly inflated which makes it very suspicious =)

In order to calculate the significance of the observed data, i.e. the total probability of observing data as extreme or more extreme if the null hypothesis is true, we have to calculate the values of p for both these tables, and add them together. This gives a one-tailed test, with p approximately 0.001346076 + 0.000033652 = 0.001379728. For example, in the R statistical computing environment, this value can be obtained as fisher.test(rbind(c(1,9),c(11,3)), alternative="less")$p.value, or in python, using scipy.stats.fisher_exact(table=[[1,9],[11,3]], alternative="less") (where one receives both the prior odds ratio and the p-value). This value can be interpreted as the sum of evidence provided by the observed data—or any more extreme table—for the null hypothesis (that there is no difference in the proportions of studiers between men and women). The smaller the value of p, the greater the evidence for rejecting the null hypothesis; so here the evidence is strong that men and women are not equally likely to be studiers.

https://en.wikipedia.org/wiki/Fisher%27s_exact_test

It is difficult to me to judge what was exactly wrong since I do not see the formulas you used, but a linear inflation by 2x factor is certainly some multiplier that was omitted.

ADD REPLY
0
Entering edit mode

Thanks for your suggestion. I ran a fishers exact test with Popoolation2 and then calculated the expected p values for a uniform distribution of the number of SNPs that returned. I've looked back at my code and can't find anything to suggest a multiplier was left out, so I don't think this is the problem so I'll probably have to either thin SNPs (maybe inflated p values was caused by strong selective sweeps) or use a program that can account for population structure.

ADD REPLY
0
Entering edit mode

You have a perfect straight line, perfect linear dependency between two large vectors of values. Neither population structure not inflated p-values cause such perfect correlation, they usually affect only "tails" of distributions.

ADD REPLY

Login before adding your answer.

Traffic: 2209 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6