Question

How Do High Fst Compare With Dn/Ds?

0

Entering edit mode

11.4 years ago

Adrian Pelin ★ 2.7k

Hello,

I calculated Fst values for a few genes in my genome pairwise between a couple of isolates.

It seems some of the genes have an order of magnitude higher Fst values then Fst values for other genes.

Fst shows population structure, and I read somewhere that genes with higher Fst values between sub populations may be involved in adaptability.

How does this compare with dN/dS > 1, which also should hint at genes involved in adaptability?

Which one would be more reliable?

Thank you, Adrian

fst • 5.4k views

ADD COMMENT • link updated 11.3 years ago by a1ultima ▴ 870 • written 11.4 years ago by Adrian Pelin ★ 2.7k

score 4 · Answer 1 · 2014-02-26

4

Entering edit mode

11.4 years ago

David W 4.9k

Fst and dN/dS are going to tell you about slightly different things.

Fst is about the distribution of alleles at site is segreating in a populatin (specifically the expected loss of heterozygosity that atises becase of allele frequency differences among subpopulations). An outlying Fst value might therefore tell you something about selection acting differently across different sub-populations (eg, an allele with a increased frequency in one population due to selection).
- dN/dS is how subsitutions (i.e. fixed differences) accruing in seperate lineages. It will tell you whether the changes in a given protein over evolutionary time are likely to have resulted from seleciton (or, in practive. how strongly constrained a sequence is)

So, dNdS will tell you about the evolution history of protein and the way it has accrued changes. Fst will tell you about the recent history of alleles at a given locus in different populations.

ADD COMMENT • link 11.4 years ago by David W 4.9k

0

Entering edit mode

Oh okay. Is there a difference in calculating Fst for tetraploids vs diploids?

Also, what about if you calculate Fst values for genes that have multiple copies in the genome. Such genes would have an increased number of haplotypes per individual and would as a result have a greater diversity. Would it tell you anything to calculate Fst for those? It is almost certain that the value would be higher than any other gene?

ADD REPLY • link 11.4 years ago by Adrian Pelin ★ 2.7k

0

Entering edit mode

So... there is a long and often-times confusing literature about Fst. The original Fst was defined only for diploid data (and two-allele systems) but other statistics have bee developed to generalise to other cases. Gst (Nei, 1973) will work for any ploidy. I don't have a good handle on how other statistics work w/ ploidy.

I would be very careful about using multi-copy genes in these sorts of studies - they'd be difficult to phase and different copies are likely to have diffeent histories to at least some degree.

ADD REPLY • link 11.4 years ago by David W 4.9k

0

Entering edit mode

Thanks a lot for your input! For Fst estimation, I used PoPoolation2. Any idea what I can use for NGS sequencing to find Gst?

ADD REPLY • link 11.4 years ago by Adrian Pelin ★ 2.7k

0

Entering edit mode

Sorry - i'm no help there. The calculation itself is quite strarighforard if you can extract estimates allele frequencies for each population, I'm not sure how you'd handle any biases in estimating allele frequencies from NGS data though?

ADD REPLY • link 11.4 years ago by David W 4.9k

0

Entering edit mode

Well, if you do PCR + cloning you introduce a bias through exponential amplificantion, meaning that more frequent haplotypes will be even more over represented will less frequent haplotypes will be under represented.

With NGS, I am not sure what the bias is.

ADD REPLY • link 11.4 years ago by Adrian Pelin ★ 2.7k

score 0 · Answer 2 · 2014-03-11

Conclusions one may draw from dN/dS ratios (aka. Ka/Ks):

Neutral Evolution: dN/dS ratio = 1 implies there has been equal numbers of synonymous (dna substitutions that do not affect the protein sequence) and non-synonymous changes (dna substitutions that do affect the protein sequence) during the time between ancestral to the modern versions of the protein.

Positive Evolution (adaptive evolution): dN/dS ratio > 1 implies there has been more non-synonymous changes than synonymous changes. There has been evolutionary pressure to escape from the ancestral state - i.e. positive selection pressure. This can occur for example in paralogues that are required to serve a novel function, or in proteins of parasites that need to escape host immune recognition (e.g. changes to avoid MHC-1 binding to evade T-cell attack).

Negative Evolution (conservation): dN/dS ratio < 1 implies there has been more synonymous changes than non-synonymous changes. There has been evolutionary pressure to conserve the ancestral state - i.e. negative selection pressure. This can occur for example in orthologues that are required to maintain (conserve) some function encoded in the protein sequence, since changes from this state would lead to disruption of function.

Useful Tips:

Algorithms can either run on multiple sequences, or just a pair of sequences. In either case the input sequences used to derive a dN/dS ratio must share ancestry - too divergent and there is a problem with multiple substitutions, too recent and you will not have sufficient enough observed changes to draw conclusions from.
dN/dS can be used to compare whole proteins or regions within proteins (a sliding dN/dS value across the protein)
A dN/dS ratio calculated for a whole protein is often an underestimate (lower than it should be) due to the variety of domains that constitute each protein, for instance a alpha-helix structure may always be required in a set of proteins that perform a variety of different functions.
The only sequence changes considered are substitutions (not duplications, or inversions etc.)
Significance of a given dN/dS ratio can be assessed using Fishers exact test: read this

Software:

Here are my recommendations for software ordered by how flexible they are:

MATLAB's Bioinformatics Toolbox: Here you have the greatest variety of alternative algorithms, operating system compatibility, sliding vs. whole protein analysis, API to Genbank, etc (Here's a great tutorial for using their dN/dS tool). Just remember MATLAB is not free.
KaKs Calculator: If you only care about whole protein dN/dS, many options are available with the Ka/Ks calculator - they also compute statistical significance using Fisher's exact test. I can also provide an R script that generates error bars from the output, just ask.
PAML: If you have >2 sequences per protein that you wish to get a dN/dS value from, then many options are available with PAML. This is often used in published papers, but it's not recommended if you only have a pair of sequences per protein.