Hello, I am a recent college grad who is doing my first bioinformatics work as part of a larger research project in my lab (not a bioinformatics focused lab). I am trying to identify the conserved portions of a protein degredation domain which is covalently attached to proteins via tmRNA. The tag is encoded by normal RNA->AA rules in the tmRNA and I have been able to identify the tag sequence in most of the 65 sequenced genomes of the phylum of interest.
My main difficulty is how to interpret this data to conclude which positions in the protein tag are the most highly conserved. My first inclination would be to simply find the distibution of amino acids at each position in the tag with reference to the stop codon at the end of the tag. I see two major faults in this method though.
- It does not account for insertions/deletions which can shift sequences which still would remain conserved but would not be identified as such with the above method.
- It is biased towards the motifs of organisms which have had multiple different strains sequenced.
So my question is how to identify conserved elements within a motif? Is the a standard method or program which is used?
My only current thought of an improvement is to address point 2 by weighting the sequences by how phylogenetically dissimiliar they are to the others. A slightly more comprehensive graphic than the one in b of this figure , (paper) is similiar to what I am looking for. Thanks!