I have a file of Data includes 20 proteins, each 10 proteins belongs to different class. My goal is to extract the similarities among 10 sequences and between those two classes in Protein level and RNA level. So, I reversed them to RNA code and I saved them in different file. I want to analysis both files of protein and RNA to find some similarity regions that may RNA share it in each class, and repeat the same thing in Protein. Firstly, I use local multiple sequence alignment of 10 RNA sequences by MUSCLE , and I used Jalview program for this purpose. Jalviews shows me some coloured area that have the same percentage of identity.
My question is:
I would like to represent these colours by numbers, but Jalview didn't give me any score or numbers to find the percentage of identity.. !! How I can analysis my data to extract the similarities among sequences of both class and both level of gene (RNA and Protein) ? And how I can represent them by quantitative measurement (e.g. similarity score, level of variation, percentage of identity) or you can suggest me another type of measurement ?
d. Calculating the amino acid conservation scores
The conservation score at a site corresponds to the site's evolutionary rate. The rate of evolution is not constant among amino (nucleic) acid sites: some positions evolve slowly and are commonly referred to as "conserved", while others evolve rapidly and are referred to as "variable". The rate variations correspond to different levels of purifying selection acting on these sites. The purifying selection can be the result of geometrical constraints on the folding of the protein into its 3D structure, constraints at amino acid sites involved in enzymatic activity or in ligand binding or, alternatively, at amino acid sites that take part in protein-protein interactions.
In ConSurf, the rate of evolution at each site is calculated using either the empirical Bayesian [11] or the Maximum Likelihood [12] paradigm. In both of these methods, the stochastic process underlying the sequence evolution and the phylogenetic tree are explicitly taken into account. The Bayesian method was shown to significantly improve the accuracy of conservation scores estimations over the Maximum Likelihood method, in particular when a small number of sequences are used for the calculations [11] . An additional advantage of the Bayesian method is that a confidence interval is assigned to each of the inferred evolutionary conservation score.
To calculate similarities BETWEEN the two classes on the protein level:
Build multiple sequence alignment separately for every class
Compare the alignments using profile-profile or HMM-HMM comparison methods (e.g. HHalign, COMPASS or COMA) Some of them will give you positional scores in one form or another (hhalign as discrete values described in the hhsuite user guide).
To find conserved positions WITHIN every class on the protein level use indeed ConSurf as @zev.kronenberg suggested.