Question

Selection from population data

2

Entering edit mode

10.3 years ago

Adrian Pelin ★ 2.6k

Hello,

I am interested in traces of positive selection in my population data. I have been able to calculate Watterson's Pi and Theta for synonymous and non-synonymous sites for every gene in my genome.

The problem is that I am a bit lost as to how to look for positive/negative selection. I do not really understand what these values are, Pi and Theta. I have seen literature where Pi(A) is divided by Pi(S), that's sort of like dN/dS, and if the ratio is bigger than 1, then we can infer positive selection?

Thanks for any help,

Adrian

popgen watterson pi theta selection • 6.9k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Adrian Pelin ★ 2.6k

Ram · Answer 1 · 2014-08-20

2

Entering edit mode

10.3 years ago

Zev.Kronenberg 12k

Population genetics can be difficult to break into, but worth it! I found a recent review that provides a decent overview of the current methods. It might not directly answer your question, but it is a good place to start.

"Detecting Natural Selection in Genomic Data"

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Zev.Kronenberg 12k

1

Entering edit mode

This is very interesting, thanks for pointing me in the right direction.

I understand measuring Fst values is a powerful way of identifying genes under diversifying selection. After computing Fst values, is there any way to determine which genes are significantly impacted? Is it possible using a statistical test to determine which genes are significantly evolving quicker then others? I have 9 population samples, so Fst is computed pairwise between any 2 populations.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

I may be a bit late to be of help here, but the way I have done this in the past is to calculate Fst on a SNP by SNP basis between each possible population pair, and then see which SNPs fall into the tail of the distribution of the Fst scores (say highest 1% - high probability of positive selection here). From there you can figure out which genes these SNPs fall into relatively easily through using a tool in R called NCBI2R. If you want to look for functional trends you can then run the gene list you get from NCBI2R through a GO term overrepresentation test like GOrilla (web-based and free, also uses FDR instead of the overly conservative Bonferroni correction).

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by confusedious ▴ 490

Ram · Answer 2 · 2014-08-20

The paper Zev links to provides a very good intro to this field.

I thought I'd just that the specific statistics you mention, Tajima's (pi) and Watterson's estimators of theta, form the basis of Tajima's D.

Briefly. The idea is that if a gene has been subject to directional selection (i.e. positive or negative selection) those variants are present will be at low frequency so nucleotide diversity will be low relative to Watterson's theta (which is based only on the number of segregating sites). A positive value for D would suggest balancing selection (maintaining an excess of medium-frequency alleles). BUT, Tajima's D is also affected by demography, since population expansion also leads to an excess of rare alleles.

As Zev's paper describes, there are a whole suite of measures that are more or less sensitive to different demographic and population genetic processes.

I'm not aware of a test that compares Pi_non-syn with Pi_syn, though some tests like McDonald Krietman include those values along with divergence stats.

Ram · Answer 3 · 2014-08-20

You could consider using integrated haplotype score if you have adequate data and are interested in relatively recent selection-driven change. This is a reasonably straight forward way of looking for signals of positive selection.

This has been used often in studies of recent human evolution.

See: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.0040072

If your data includes sub-populations of reasonable sizes you could also consider using Fst to find variants that might be indicative of selection driven differences between sub-popualtions as well.

Ram · Answer 4 · 2014-09-15

1

Entering edit mode

10.3 years ago

Chrispin Chaguza ▴ 280

As you'll have already known, interpretation of these values can indeed be very tricky. For example, how do you know or test whether the Tajima's D estimate is significant?. The Wikipedia link shows a table that provides a summary on how to interpret the results (http://en.wikipedia.org/wiki/Tajima%27s_D) and it also provides a 'rule of thumb' that suggests that values less than -2 or greater than +2 are generally significant (but do not represent critical values).

There is also a paper that provides a method for constructing critical for Tajima's D (and similar statistics) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1206737/

ADD COMMENT • link 10.3 years ago by Chrispin Chaguza ▴ 280

0

Entering edit mode

Just in comment here, if you want a quick and easy way to calculate Tajima's D then download MEGA. It's free and offers calculation of this statistic from an alignment file - very simple.

You're on your own though on figuring out whether the D value is significant.

ADD REPLY • link 10.3 years ago by confusedious ▴ 490

0

Entering edit mode

Problem is I got .vcf/snp data, not alignments.

ADD REPLY • link 10.3 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

Ah, well if that's the case then methods like Fst outlier analysis or integrated haplotype score would be the way to go.

ADD REPLY • link 10.3 years ago by confusedious ▴ 490

1

Entering edit mode

A shameless plug, but try out my GPAT suite of tools for selection: http://github.com/jewmanchue/vcflib/wiki

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Forgot to mention, I am working on spores, and since you can't sequence single spores efficiently, my samples consist of populations of spores, so in a way it is a pooled sample. Very hard to call SNPs and phase data.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Adrian Pelin ★ 2.6k

1

Entering edit mode

That does make things a bit harder. One approach you could take, though somewhat speculative, would be to calculate a diversity score of some kind at each locus and then produce a distribution of this score. Scores that are at the low end of the spectrum might be indicative of loci that have been under a selective sweep or purifying selecion, and scores at the high end may be examples of loci under balancing selection. This isn't an iron-clad way of doing things, but you can say something about the data this way as opposed to not much.

I second Zev's recommendation to look at GPAT. I just took at look at the github site and it does look very useful. I wish I had known about it back when I was doing my Master's thesis and calculating pairwise Fst at ~3,000,000 individual SNPs between three populations (I did it in R - it took forever).

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by confusedious ▴ 490