How Can I Profile A Multiple Alignment Result To Get A Logo Sequence To Represent All Aligned Sequence?
1
5
Entering edit mode
13.4 years ago
Ct586 ▴ 630

I have several similar protein sequences, like

>a
GKGGGIGGGIGKGGG
>b
GKGGGIGGGIGKGGGIGGG
>c
GKGGGIGGGIGKGGGIGGGI
>d
GKGGGVGGGIGKGGG

Then, I aligned them use Clustral, get results like

CLUSTAL 2.1 multiple sequence alignment


b               GKGGGIGGGIGKGGGIGGG-
c               GKGGGIGGGIGKGGGIGGGI
a               GKGGGIGGGIGKGGG-----
d               GKGGGVGGGIGKGGG-----
                *****:*********

I wonder how I can use one sequence to represent the mentioned four sequences with no or the least loss of informations.

Before, I tried HMMER which can use a Hidden Markov Model to profile sequences. The results contain in a matrix model.

And when I wrote down the title, biostar system recommends a question Score protein variants based on frequency of AA in multiple sequence alignment, which solution is similar with HMMER.

Also Weblogo can give me a picture to show the motif sequences, but I think it will cause loss of information and picture is not suitable for batch processing.

There is a picture in paper BH3-only proteins in apoptosis and beyond: an overview, I saw the picture below.

alt text

It use special characters to represent similar amino acids.

Before I find a more suitable expression, I think this result is what I want.

So will you guys recommend some tools to solve this?

Thank you!

motif sequence • 4.9k views
ADD COMMENT
0
Entering edit mode

I like the representation of motifs as regular expressions: http://elm.eu.org/help.html#nomenclature (but don't know of an automated conversion tool). I find special characters to be very annoying as a reader of a paper because I have to go back and forth between the legend and the sequence.

ADD REPLY
0
Entering edit mode

Michael, RegEx for motifs works only sometimes. They form the basis of motif databases such as PROSITE and ELM. However, Many real motifs have small variations of the consensus that would violate a regular expression, unless you formulate it very inclusive to catch all instances. However, this also means lots of false positive. Unless motifs have a very strict consensus requirement, they are difficult to treat by regular expressions.

ADD REPLY
0
Entering edit mode

Thank you! I have thought about regular expression, it works well sometimes. But it is hard to construct many regular expressions. And the information lost is serious also. I agree with what Lyco said. Thanks for your hint.

ADD REPLY
4
Entering edit mode
13.4 years ago
Lyco ★ 2.3k

The kind of consensus display used in the BH3 paper is associated with a substantial loss of information, as the fancy greek symbols only represent 'majority votes', neglecting minority observations. Moreover, the symbols are not standardized - the only special symbol most authors agree on is the uppercase Phi for hydrophobics.

In fact, sequence logos loose much less informations than consensus sequences (and they can be generated in batch mode, too. There is a program called seqlogo which can be downloaded from the weblogo pages. The major disadvantage is that the weblogos are bitmap images and cannot be used as simple text items. If this isn't a problem for you, I would recommend sequence logos over consensus sequence display.

The least information loss is suffered when using frequency tables, basically two-dimensional matrices showing which residue is observed how often at what position. On the cumbersome for display in papers.

By the way, don't trust the BH3 consensus provided in the paper. At least a third of the sequences shown are no genuine BH3 motifs.

ADD COMMENT
1
Entering edit mode

Agreed. If you want to visually capture the information to show to an audience in a paper a sequence logo retains the most information. If you want to actually DO something computationally you're better off with a PSSM, Markov Model (HMMER), or similar

ADD REPLY
0
Entering edit mode

Thank you! I think this is the best strategy here. I will use HMM profile which contain the most information as the motif to do the searching part, and weblogo to represent the consensus visually.

ADD REPLY
0
Entering edit mode

Hello. My answer is maybe too late but indeed, HMMER + WebLogo is a good combo to 1/ catch the information contained in a batch of related sequences and 2/ represent the amino acid characteristics of these proteins. We did this last year in this paper http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0009990

ADD REPLY

Login before adding your answer.

Traffic: 1938 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6