I have several similar protein sequences, like
>a
GKGGGIGGGIGKGGG
>b
GKGGGIGGGIGKGGGIGGG
>c
GKGGGIGGGIGKGGGIGGGI
>d
GKGGGVGGGIGKGGG
Then, I aligned them use Clustral, get results like
CLUSTAL 2.1 multiple sequence alignment
b GKGGGIGGGIGKGGGIGGG-
c GKGGGIGGGIGKGGGIGGGI
a GKGGGIGGGIGKGGG-----
d GKGGGVGGGIGKGGG-----
*****:*********
I wonder how I can use one sequence to represent the mentioned four sequences with no or the least loss of informations.
Before, I tried HMMER which can use a Hidden Markov Model to profile sequences. The results contain in a matrix model.
And when I wrote down the title, biostar system recommends a question Score protein variants based on frequency of AA in multiple sequence alignment, which solution is similar with HMMER.
Also Weblogo can give me a picture to show the motif sequences, but I think it will cause loss of information and picture is not suitable for batch processing.
There is a picture in paper BH3-only proteins in apoptosis and beyond: an overview, I saw the picture below.
It use special characters to represent similar amino acids.
Before I find a more suitable expression, I think this result is what I want.
So will you guys recommend some tools to solve this?
Thank you!
I like the representation of motifs as regular expressions: http://elm.eu.org/help.html#nomenclature (but don't know of an automated conversion tool). I find special characters to be very annoying as a reader of a paper because I have to go back and forth between the legend and the sequence.
Michael, RegEx for motifs works only sometimes. They form the basis of motif databases such as PROSITE and ELM. However, Many real motifs have small variations of the consensus that would violate a regular expression, unless you formulate it very inclusive to catch all instances. However, this also means lots of false positive. Unless motifs have a very strict consensus requirement, they are difficult to treat by regular expressions.
Thank you! I have thought about regular expression, it works well sometimes. But it is hard to construct many regular expressions. And the information lost is serious also. I agree with what Lyco said. Thanks for your hint.