Hi I am trying to calculate the CAI (codon adaptation index) for a list of genes of interest. I decided to use the RSCU values from the Codon Bias database (CBDB), for the relative organism (B. subtilis 168): http://homepages.luc.edu/~cputonti/cbdb/genera/bacillus.html#s30 .
I implemented a small python script to do so. While doing so, I couldn't decide between two methods tested them both: one is from Biopython SeqUtils (module CodonUsage, method cai_for_gene) and the other from the 'CAI' package (method CAI) (https://pypi.python.org/pypi/CAI). Since I was using the same reference RSCU values, and both claim to implement the method from "Sharp & Li, 1987, NAR (15)", I was expecting the same value for each gene from both packages. But this did not happen and it is clear at first sight as the results from the SeqUtils package are not included between 0 and 1 (in some cases are >1), which is in contrast with the definition of CAI, while for the 'CAI' package results were within the range.
I further investigated trying to regenerate the RSCU values and looking at both source codes. The first thing I noticed is that from the same fasta file (i.e. same collection of genes, still downloaded from CBDB) I was getting two different RSCUs values on the two packages, with CAI-generated one matching the online posted values, while the values generated by SeqUtils were different. I assumed that the dictionary used by SeqUtils was not RSCUs values, thus I used the SeqUtils-generated RSCUs value to calculate the CAI for my genes, but still values were not allowed. Then looking at the source code I noticed that SeqUtils implementations differs a lot from the paper of Sharp and Li, unless it is a mathematical equivalence I do not know/understand. On the other hand the approach implemented in the CAI package is identical to the paper.
This given, I assume the CAI package is returning the correct values.
My question is thus: what is then SeqUtils calculating? I could not retrieve a lot of info about this package. Could there be an error? Why the two algorithms differ? And why the one from SeqUtils differs also from the paper? Anyone able to explain this?
P.S. I am of course assuming I used correctly both methods. Since I followed the relative guides, I feel reasonably safe in assuming so.
Benjamin,
Could you let us know what was wrong with the CAI calculation in SeqUtils?
Thanks!
Hi, I am trying to estimate CAI for some genes of interest using CAI package but getting an error. Could you please provide some insights on how were you able to calculate CAI?
Thanks a lot