I've noticed that most programs used to predict the effect of non-synonymous amino acids substitutions simply don't analyze the action of selection of any kind. Given that conservation in alignment does not mean necessarily purifying selection, I'm wondering about the true relation between conservation and selection. In order to test this idea, I really need some guidelines to calculate Ka/Ks in close and distantly related species. Any software able to deal with such situation is much welcome too!
-- Edit --
I know a lot of programs able to calculate Ka/Ks. What I really want is an expert opinion of what to do and which one to use.
Conclusions one may draw from dN/dS ratios (aka. ~Ka/Ks):
Neutral Evolution (drift): dN/dS ratio = 1 implies there has been equal numbers of synonymous (dna substitutions that do not affect the protein sequence) and non-synonymous changes (dna substitutions that do affect the protein sequence) during the time between ancestral to the modern versions of the protein.
Positive Selection (adaptive evolution): dN/dS ratio > 1 implies there has been more non-synonymous changes than synonymous changes. There has been evolutionary pressure to escape from the ancestral state - i.e. positive selection pressure. This can occur for example in paralogues that are required to serve a novel function, or in proteins of parasites that need to escape host immune recognition (e.g. changes to avoid MHC-1 binding to evade T-cell attack).
Negative Selection (conservation): dN/dS ratio < 1 implies there has been more synonymous changes than non-synonymous changes. There has been evolutionary pressure to conserve the ancestral state - i.e. negative selection pressure. This can occur for example in orthologues that are required to maintain (conserve) some function encoded in the protein sequence, since changes from this state would lead to disruption of function.
Useful Tips:
Algorithms can either run on multiple sequences, or just a pair of sequences. In either case the input sequences used to derive a dN/dS ratio must share ancestry - too divergent and there is a problem with multiple substitutions, too recent and you will not have sufficient enough observed changes to draw conclusions from.
dN/dS can be used to compare whole proteins or regions within proteins (a sliding dN/dS value across the protein.)
A dN/dS ratio calculated for a whole protein is often an underestimate (lower than it should be) due to the variety of domains that constitute each protein, for instance a alpha-helix structure may always be required in a set of proteins that perform a variety of different functions.
The only sequence changes considered are substitutions (not duplications, or inversions etc.)
Significance of a given dN/dS ratio can be assessed using Fishers exact test: read this.
Software:
Here are my recommendations for software ordered by how flexible they are:
MATLAB's Bioinformatics Toolbox: Here you have the greatest variety of alternative algorithms, operating system compatibility, sliding vs. whole protein analysis, API to Genbank, etc (Here's a great tutorial for using their dN/dS tool). Just remember MATLAB is not free.
KaKs Calculator: If you only care about whole protein dN/dS, many options are available with the Ka/Ks calculator - they also compute statistical significance using Fisher's exact test. Here's an R script I made to help you out, it plots a bar chart (with error bars), from the KaKs Calculator output files.
PAML: If you have >2 sequences per protein that you wish to get a dN/dS value for, then many options are available with PAML. This is often used in published papers, but it's not recommended if you only have a pair of sequences per protein.
VectorBase dN/dS app.: If you you just want to quickly visualize a sliding window dN/dS curve, without installing anything, me and a friend have finished a website, for you to do just that. It allows for simple pairwise, sliding window dN/dS analysis using the Nei-Gojobori method. Accuracy is comparable to MATLAB's Bioinformatics Toolbox, and performance is optimized for phone. So far the GUI can only accept VectorBase gene ids, but the dnds.py script itself can take any input pair of proteins. Performance is heavily optimized, e.g. a pair of 3000bp long proteins in <1s. When the site is finally hosted I'll post a link, code: https://github.com/a1ultima/hpcleap_dnds
Also.. If you wish to run dN/dS on 100-1000s of proteins me and a friend have created our own Python script that implements the the Nei-Gojobori algorithm (but uses approximate multiple-substitutions correction rather than exact) on large-scale protein sequence pair data (i.e. one pair of sequences per protein). it implements both whole-protein and sliding-window modes. Given two million base-pairs of sequence data it will finish the job in <15 seconds with ~1% error vs. MATLAB. If you ask, I can properly document it and publish on GitHub.
Thank you for the detailed answer. It helped me a lot. Could you send me the code, too, please? My email is kitaekim077@gmail.com. I will appreciate it.
Hi @a1ultima, I know I am little late to join this ticket, but did you ever make a github page for your program. If not, could you please send it to me through email, salamzader21@gmail.com Much appreciated.
Hi a1ultima, I know im very late to ask you the python script. Could you send me the link of GitHub or via email to vharshavardhanan@gmail.com. I will appreciate it. Thank you.
For all who have asked for the dN/dS Python file, it is finally on GitHub, I really do apologize for how long I have taken to get back to some of you. This Python file I am sharing is part of a bigger pipeline, but I hope I have documented it well enough now so that you can just take this part and play with it yourself:
We've made a GUI (in the form of a webpage: vg-genes.html), that works with a local Python server (start_server.py), so that means you don't have to mess around with the dnds.py script itself. I will post a URL to the GUI as soon as it is hosted, if you cannot wait, then read the README.md to set up the local python server yourself and also how to run the GUI. Though, it currently has restrictions: can only accept VectorBase gene ids as input (it needs to do a REST call to the vectorbase.org to retrieve sequence data). But later we'll try and make it more flexible. If you want to analyze pairs of protein sequences outside of VectorBase, then focus only on dnds.py (above).
For very fast computation dnds.py needs to work with changes.py, in that it imports changes.py in the same directory. But you don't need to really worry about it, since we decided the pre-pickled computations data was worth uploading to GitHub anyway, i.e. you will only ever have to read the changes.py code if you decide to delete, the pre-pickled computations.
If your pair of input sequences that are not already aligned, you will be faced with a performance bottleneck in the alignment process, a step that is necessary to calculate dN/dS via dnds.py. By default, dnds.py will try to align the inputs by importing align.py, but I have tried to write informative comments in dnds.py to help you turn this step off. The catch is that you would then need to have the inputs codon aligned prior to running dnds.py. Apart from boosted performance, another issue with keeping the align.py step running is that it needs to import an external Python module called Biopython (import Bio). If you do not want to install Biopython, then make sure you align the inputs yourself, turn off the alignment part in dnds.py, and run dnds.py.
I am not sure this message will be seen by any of those who asked for it, will this pop up in their notifications? I will for now assume it will.
Hello, thanks for your tutorial. Could you kindly send me the R code as well, and current link for KaKs calculator.
Email: ronoevan@gmail.com
Thanks,
Evans
There really is no better advice than as posted above: use PAML!
To expand a bit: PAML is the accepted standard for codon-level selection analyses and is used in a sizable proportion of publications that mention either the words 'evolution' or 'selection'. The codeml sub-program is what you'd want to use: it supports nearly everything you might want to test with respect to estimating dN/dS or detecting / quantifying purifying or positive selection in genes using codon alignments.
Common tasks performed with codeml involve:
Estimating a single dN/dS ratio (roughly equivalent to Ka/Ks) for a single alignment of coding sequences. This gives a ballpark figure for the amount of protein-level selective constraint experienced by a gene. But be warned, most genes do not evolve homogenously along their length -- different domains (extracellular vs. intracellular, for example) may experience drastically different levels of constraint.
Identifying different dN/dS ratios in different branches of a phylogenetic tree
Identifying genes with individual sites or regions under positive selection, sometimes in specific branches of a tree.
PAML and codeml are complicated (and sometimes finicky) programs, with dozens of analyses available depending on how you configure them. In order to use it well, you'll really want to either:
Read the documentation
Read one or two of Ziheng Yang's many publications about the models PAML implements and the applications it enables
Look at the examples in the PAML download package.
Regarding the 'true relation between conservation and selection': the codon models implemented by PAML are more sensible than amino acid-based measures for measuring selection pressures acting on proteins because they offer built-in 'correction' for synonymous mutation rates (as long as synonymous sites are evolving neutrally, which seems to be largely the case with smallish population sizes).
Amino-acid based measures of evolutionary conservation are not so well normalized, because the rate of amino acid change is the result of a combination of mutation rate plus selection for conservation of protein structure and function. Codon models, on the other hand, use DNA alignments of coding regions and an explicit model incorporating the genetic code to estimate the amount of natural selection for or against protein-level changes. To paraphrase Ziheng Yang's Computational Molecular Evolution: dN/dS represents the ratio between the rate at which a protein is evolving to the rate at which it would be evolving were it a non-structured, non-functional string of DNA characters.
I do use PAML already. And HyPhy too. I have a problem with distance. PAML is effective for distantly related species. HyPhy can be used to same species ration calculation when mutation rates are large. As I want to calculate it for certain human proto-oncogenes, I'm not certain on how to proceed.
The entire justification behind dN/dS calculations basically requires fixed differences between species resulting from long periods of natural selection. I'm not sure that in the case of observed mutations within proto-oncogenes or intra-species variability you will ever get sensible results from this type of a measure. I enjoyed this analysis of the issue.
If you're interested in quantifying the amount of 'functional change' resulting from certain mutations, perhaps Grantham scores would be a viable alternative...
In fact, this ratio is only applicable when a significant amount of fixations by selection occur (which depends on a lot of effects). I'm aware of Plotkin's paper. Distance is a relative measure. Most RNA viral populations show intraspecific distances much higher than any large group of vertebrates. So, it's feasible to use it in their case. But, the question how much distance you need to call lineages distant? I wasn't aware of Grantham matrices. Anyway, the are quite similar to SDM/Andante approach (which I already use).
We present POTION, an open source, modular and end-to-end software for genome-scale detection of positive Darwinian selection in groups of homologous coding sequences. Our software represents a key step towards genome-scale, automated detection of positive selection, from predicted coding sequences and their homology relationships to high-quality groups of positively selected genes. POTION reduces false positives through several sophisticated sequence and group filters based on numeric, phylogenetic, quality and conservation criteria to remove spurious data and through multiple hypothesis corrections, and considerably reduces computation time thanks to a parallelized design.
Internally it runs PAML's codeml together with a few other tools. I've tested it with one test data-set and it's much nicer to use compared to running codeml directly.
This is a legacy tool with limited compute power attached. It may -or may not- work fine for small datasets but we do not recommend its use beyond a handful of sequences. I recommend using a Galaxy version of codeML instead.
Can we do dN/dS or Ka/ Ks for multiple genome , as I hve read generaly people were doing for pair genome, If we can do for multiple genome how we will do, do w ehave to iterate / loop in programming or we have to go manually, next can w edo dN/dS or Ka/Ks within different genus for positive selection, as I am having 20 different genus or even different family, so can we do this Ka/Ks within different family group of bacteria if so can we do in one go and how, or I have to do manually,
There is also Galaxy tool if you don't want to install the PAML software locally: KaKsAnalysis available via https://usegalaxy.org
We also have the codeML tool accessible via https://usegalaxy.no
Also.. If you wish to run dN/dS on 100-1000s of proteins me and a friend have created our own Python script that implements the the Nei-Gojobori algorithm (but uses approximate multiple-substitutions correction rather than exact) on large-scale protein sequence pair data (i.e. one pair of sequences per protein). it implements both whole-protein and sliding-window modes. Given two million base-pairs of sequence data it will finish the job in <15 seconds with ~1% error vs. MATLAB. If you ask, I can properly document it and publish on GitHub.
https://github.com/a1ultima/hpcleap_dnds
Specifically you need:
For more details, either read this, or see the comment below.
Hi, can you send me that script too? Or publish on Github
Hi! Have you published the python script on GitHub? Or could you please send me the python? Thank you so much!
Thank you for the detailed answer. It helped me a lot. Could you send me the code, too, please? My email is kitaekim077@gmail.com. I will appreciate it.
Hi @a1ultima, I know I am little late to join this ticket, but did you ever make a github page for your program. If not, could you please send it to me through email, salamzader21@gmail.com Much appreciated.
Hi a1ultima, I know im very late to ask you the python script. Could you send me the link of GitHub or via email to vharshavardhanan@gmail.com. I will appreciate it. Thank you.
For all who have asked for the dN/dS Python file, it is finally on GitHub, I really do apologize for how long I have taken to get back to some of you. This Python file I am sharing is part of a bigger pipeline, but I hope I have documented it well enough now so that you can just take this part and play with it yourself:
https://github.com/a1ultima/hpcleap_dnds
Specifically you need:
Optionally:
We've made a GUI (in the form of a webpage: vg-genes.html), that works with a local Python server (start_server.py), so that means you don't have to mess around with the dnds.py script itself. I will post a URL to the GUI as soon as it is hosted, if you cannot wait, then read the README.md to set up the local python server yourself and also how to run the GUI. Though, it currently has restrictions: can only accept VectorBase gene ids as input (it needs to do a REST call to the vectorbase.org to retrieve sequence data). But later we'll try and make it more flexible. If you want to analyze pairs of protein sequences outside of VectorBase, then focus only on dnds.py (above).
For very fast computation dnds.py needs to work with changes.py, in that it imports changes.py in the same directory. But you don't need to really worry about it, since we decided the pre-pickled computations data was worth uploading to GitHub anyway, i.e. you will only ever have to read the changes.py code if you decide to delete, the pre-pickled computations.
If your pair of input sequences that are not already aligned, you will be faced with a performance bottleneck in the alignment process, a step that is necessary to calculate dN/dS via dnds.py. By default, dnds.py will try to align the inputs by importing align.py, but I have tried to write informative comments in dnds.py to help you turn this step off. The catch is that you would then need to have the inputs codon aligned prior to running dnds.py. Apart from boosted performance, another issue with keeping the align.py step running is that it needs to import an external Python module called Biopython (
import Bio
). If you do not want to install Biopython, then make sure you align the inputs yourself, turn off the alignment part in dnds.py, and run dnds.py.I am not sure this message will be seen by any of those who asked for it, will this pop up in their notifications? I will for now assume it will.
Hi a1ultima,
Could you please send me the R script for error rate calculation for KaKs Calculator? My email id is rpraveenkumarbio@gmail.com
Hello, thanks for your tutorial. Could you kindly send me the R code as well, and current link for KaKs calculator. Email: ronoevan@gmail.com Thanks, Evans
I am really sorry about the embarassingly long wait before giving the R script for calculating error-bars for your dN/dS analyses.
For anybody who still needs it, I've put the script on GitHub with instructions on what you need to do to get it to work in the README.md:
https://github.com/a1ultima/kaks_error_bars
Assalm o alikum everyone,
I'm using ka ks calculator for ka ks ratio. But i'm too confused either it is working for multiple sequence or only for a pair of sequences ????
Hi Assalm, I've only ever got it to work with pairs