Given genotype information (SNP allele frequencies) from different pooled populations, how can I calculate the probability (or significance, or some confidence value) of the populations having different allele frequencies? I'm not terribly interested (and it's easy to do anyway) in estimating the actual allele frequencies, just the probability of it differing.
E.g. if I have alleles A and C, with observed frequencies of 9/3 and 6/1, I can estimate minor allele frequency to 0.33 and 0.17, respectively, but is the difference significant - and what is the p-value?
Or more generally: given two set of samples from binomial distributions with success probabilities p1 and p2, how can I calculate the probability of p1=p2?
It seems this would be a rather fundamental task, but I haven't found any good source on how to do this, and my statistics-fu seems to have rusted...
Yes, I have #3, that is, moderate to deep sequencing of pools of individuals from separate populations. I've used bcftools, but it seems to be geared towards having two genotypes, and the documentation is a bit impenetrable. I can try -T pair, but I think it's better if I just write my own solution for this - that way it will be comprehensible to me.
Akamai seems to be struggling, so Boston College and github are slow as molasses right now, but I'll check out FreeBayes when they get back in order. From a cursory glance, it looks like just another SNP caller, with a zillion knobs and dials that needs manual adjustment - I'm not sure it will be any help.
When I have a few good variant candidates that are likely to be different, I can go back and do proper individual based genotyping, as in your #1.
Yeah, the freebayes option I remembered is:
"To run on pooled data, set the --pooled flag (which turns off the prior component derived from the probability of a specific distribution of heterozygotes and homozygotes given the allele frequency), and set --ploidy to the number of total copies of the genome in each pooled sample."
But like I said, I've never used it, so I can't vouch for it firsthand. I'd be curious to know if you try it out and it works.
In regards to option 3, could you explain how you would construct the likelihood ratio test (or provide a source with more information)?