Question

Homologous Sequence Identity

1

Entering edit mode

10.9 years ago

Pappu ★ 2.1k

I am trying to figure out how sequences of two protein families are conserved in various species. I have two fasta files containing sequences of each of the families. Now I want to find out in the scale of 0-1 the amino acid conservation of the sequences in the files.

Does it make sense? I am wondering if there is any technical term which relates to what I am trying to do. Is it equivalent to the average of the dN/dS of all amino acid sites? The main objective is to identify if there is any evolutionary pressure which makes one family more conserved than the other.

• 3.6k views

ADD COMMENT • link updated 10.9 years ago by Asaf 10k • written 10.9 years ago by Pappu ★ 2.1k

1

Entering edit mode

For your clear question, +1.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.9 years ago by a1ultima ▴ 860

score 6 · Answer 1 · 2014-03-11

dN/dS (also known as Ka/Ks) analysis does indeed provide a way to infer conservation of protein sequences. The value of dN/dS actually varies form 0 to infinity, but it is a ratio whose expected null is by default centered around 1 (neutral evolution). If you are only interested in the conservation signal, you can always focus on the dN/dS values that range between 0 and 1.

Conclusions one may draw from dN/dS ratios:

Neutral Protein Evolution: dN/dS ratio of 1 implies there has been equal numbers of synonymous (dna substitutions that do not affect the protein sequence) and non-synonymous changes (dna substitutions that do affect the protein sequence) during the time between ancestral to the modern versions of the protein.

Positive Evolution (adaptive evolution): dN/dS ratio > 1 implies there has been more non-synonymous changes than synonymous changes. There has been evolutionary pressure to escape from the ancestral state - i.e. positive selection pressure. This can occur for example in paralogues that are required to serve a novel function, or in proteins of parasites that need to escape host immune recognition (e.g. changes to avoid MHC-1 binding to evade T-cell attack).

Negative Evolution (conservation): dN/dS ratio < 1 implies there has been more synonymous changes than non-synonymous changes. There has been evolutionary pressure to conserve the ancestral state - i.e. negative selection pressure. This can occur for example in orthologues that are required to maintain (conserve) some function encoded in the protein sequence, since changes from this state would lead to disruption of function.

Useful Tips:

Algorithms can either run on multiple sequences, or just a pair of sequences. In either case the input sequences used to derive a dN/dS ratio must share ancestry - too divergent and there is a problem with multiple substitutions, too recent and you will not have sufficient enough observed changes to draw conclusions from.
dN/dS can be used to compare whole proteins or regions within proteins (a sliding dN/dS value across the protein)
A dN/dS ratio calculated for a whole protein is often an underestimate (lower than it should be) due to the variety of domains that constitute each protein, for instance a alpha-helix structure may always be required in a set of proteins that perform a variety of different functions.
The only sequence changes considered are substitutions (not duplications, or inversions etc.)
Significance of a given dN/dS ratio can be assessed using Fishers exact test: read this

Software:

Here are my recommendations for software ordered by how flexible they are:

MATLAB's Bioinformatics Toolbox: Here you have the greatest variety of alternative algorithms, operating system compatibility, sliding vs. whole protein analysis, API to Genbank, etc (Here's a great tutorial for using their dN/dS tool). Just remember MATLAB is not free.
KaKs Calculator: If you only care about whole protein dN/dS, many options are available with the Ka/Ks calculator - they also compute statistical significance using Fisher's exact test. I can also provide an R script that generates error bars from the output, just ask.
PAML: If you have >2 sequences per protein that you wish to get a dN/dS value from, then many options are available with PAML. This is often used in published papers, but it's not recommended if you only have a pair of sequences per protein.

score 1 · Answer 2 · 2014-03-11

I would suggest (after the detailed answer of a1ultima) a different approach, not considering dN/dS but accounting on a given phylogenetic tree. The program Rate4Site for instance computes the most likely substitution rate of each amino acid in the protein along the tree, resulting with rate for each position.