Question

How To Simulate The Null Distribution For Tajima'S D?

3

Entering edit mode

13.2 years ago

David W 4.9k

I know there are a few populations genetics/molecular evolution types in the Biostar community, so hopefully someone has already run into this problem.

I've calculated Tajima's D statistic for a couple of DNA alignments using the R package PEGAS, which gives a p-value using a Beta distribution. It happens that the Beta distribution doesn't perfectly model the distribution of D under neutral evolution, so many authors simulate data under neutrality to make a null distribution. Unfortunately, all the methods for these papers tend to say about this procedure is "we used ms to simulate the null...".

Does anyone know how to do this properly. My understanding was you simulated data with a matching number of segregating sites, so it you had 100 samples and 20 segregating sites you'd do

ms 100 1000 -s 20

giving 1000 simulations. However when I calculate D from those simulations I get a skewed distribution with a mean near to 1 (I would have though it would b 0) but a p-value pretty close to the one using the Beta distribution. It's entirely possible I messed up somewhere in the calculations, (you can see my function here) but I thought I'd make sure I'm actually simulating the distribution correctly before I pick that apart.

population sequence • 6.8k views

ADD COMMENT • link 13.2 years ago by David W 4.9k

0

Entering edit mode

Yeah! Love population genetics questions!

ADD REPLY • link 13.2 years ago by Jarretinha 3.4k

score 4 · Answer 1 · 2011-09-17

Aha! Found the problem! After checking you code carefully for mistakes (found none) I've noted that your interpretation (and expectation) of the test could be you problem. Tajima's D isn't normally distributed and is naturally skewed. By using a fixed number of segregating sites, 4Nu/2Nu will vary between samples to enforce the restriction. So, you can obtain a positively skewed distribution even in a neutral setup. I'm not aware of the overall sensitivity of Tajima's D in this situation. You can try to generate a much larger number of samples and verify if the skeweness still persists. If PEGAS is correct, it should decrease.

score 3 · Answer 2 · 2011-09-17

You are on the right track and Jarretinha provides additional (correct) insight into the problem. If you need to check whether your solution in R is giving good results, you can run a test alignment through your script and also through DNAsp, which provides methods to compute the confidence interval on Tajima's D via standard neutral coalescent simulations.

score 3 · Answer 3 · 2011-09-18

(not sure if this should be an answer, a comment or an edit to the original question - let me know if I'm doing it wrong...)

So, it turns out there was something funny going on in the way I was calculating D_ from my simulated data (I suspect the polymorphism counting step, but having dissected that yet). Thankfully ms comes with another utility sample_stats that calculates some of the standard diviersity indices including _D for each replicate so

ms 20 100 -s 20 | sample_stat > sims.tsv

Creates a text file with D_ in the sixth column. To call this in R and get the distribution of _D I've used

cmd <- paste("ms", n, nreps, "-s", S, "| sample_stats")
sims <- system(cmd, intern=TRUE)
get.D <- function(x) unlist(strsplit(x, "\t"))[6]
Ds <- as.numeric(unlist(lapply(sims, get.D)))

Which gives a result close to that of Arlequin (haven't checked DNAsp).

Thanks Casey and Jarretinha - both helpful answers, Jarrenthinha get's the Big Green Tick since getting to the bottom of what I was actually expecting from the simulations is what helped me get this answer.