Question

How To Calculate The Per Column Sequence Similarities Between The Already Aligned Sequence Profiles?

1

Entering edit mode

14.0 years ago

Jan Kosinski ★ 1.6k

I have a profile-profile alignment and I want to get the per column sequence similarity. That is, for every aligned column pair I need to get a score.

Do you know any program or programming library that can take already aligned profiles and calculate the per column scores (using any reasonable scoring function)?

alignment sequence • 6.1k views

ADD COMMENT • link updated 14.0 years ago by Bilouweb ★ 1.1k • written 14.0 years ago by Jan Kosinski ★ 1.6k

0

Entering edit mode

Do you have the alignment in fasta format ? If so, you can use MstatX

ADD REPLY • link 14.0 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

isn't COMPASS downloadable as a standalone? http://www.soton.ac.uk/~re1u06/software/compass/index.html

ADD REPLY • link 14.0 years ago by Jeremy Leipzig 23k

0

Entering edit mode

I can convert it to any format and MstaX looks cool, but again, it calculates scores for ONE multiple sequence alignment. I have TWO, aligned as profile to profile (definition of profile-profile alignment is e.g. here http://phylogenomics.berkeley.edu/profilealignment/).

ADD REPLY • link 14.0 years ago by Jan Kosinski ★ 1.6k

0

Entering edit mode

Many programs like COMPASS for profile-profile or HMM-HMM comparison are downloadable, but I cannot find any that 1) takes existing alignment of two profiles, instead of aligning them by itself, and 2) returns the local scores

ADD REPLY • link 14.0 years ago by Jan Kosinski ★ 1.6k

0

Entering edit mode

Jeremy, why it has been tagged with "sequence-logo". I don't look for sequence logos for profiles, and there is a tool for that http://www.sanger.ac.uk/cgi-bin/software/analysis/logomat-p.cgi

And what does the "PWM" tag mean?

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.0 years ago by Jan Kosinski ★ 1.6k

score 2 · Answer 1 · 2011-04-28

2

Entering edit mode

14.0 years ago

Bilouweb ★ 1.1k

From the discussion in the comments, you want to calculate a similarity score between two columns which are aligned to each other.

A column can be represented as a vector of 20 values, i.e. the number of each amino acid in the column. So you can see it as calculate the distance between two vectors. It can be easily implemented as a first score.

ADD COMMENT • link 14.0 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

Yes, and I think I could it quite easily with PyCogent. I may try. Although it would be extremely simple score, for example it does not take into account amino acid similarities.

ADD REPLY • link 14.0 years ago by Jan Kosinski ★ 1.6k

0

Entering edit mode

To take in account amino acids similarities, you can use a substitution matrix in a vectorial measure. A good example is presented in : Scoring residue conservation, Valdar, 2002 (score named C_Thomson used in Clustal)

ADD REPLY • link 14.0 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

Yes yes, of course, the point is that I feel like re-inventing the wheel, scores for comparing columns from two profiles have been improving for more than a decade and now, and they are quite accurate now. The equations are there, but they are quite complicated, so that's why I look for ready to use solution ;-)

ADD REPLY • link 14.0 years ago by Jan Kosinski ★ 1.6k

0

Entering edit mode

I agree... That is why I made MstatX. I found many equations for many scores, but often, sources are not available and/or no more maintained. Now, when it seems I will lose time to search the web and if equations are not too hard, I recode them.

ADD REPLY • link 14.0 years ago by Bilouweb ★ 1.1k

score 0 · Answer 2 · 2011-04-28

0

Entering edit mode

14.0 years ago

Will 4.6k

The Matlab bioinformatics toolbox can do this easily. you can read in an alignment in virtually any format using multialignread. Then you can use seqconsensus, seqlogo or seqprofile to do almost anything you might need.

ADD COMMENT • link 14.0 years ago by Will 4.6k

0

Entering edit mode

No no, I don't want to get scores of columns of a single multiple sequence alignment. I have TWO multiple sequence alignments aligned to each other, like from typical profile/HMM-profile/HMM comparison (e.g. HHSearch). Now I need to calculate the SIMILARITY SCORES OF ALIGNING EVERY ALIGNED PAIR OF COLUMNS.

Such scoring functions are listed for instance here (Section Scoring functions of Materials and Methods): http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2279992/

ADD REPLY • link 14.0 years ago by Jan Kosinski ★ 1.6k

0

Entering edit mode

If you have two multiple alignments aligned to each other, then the combination of them gives you one multiple alignment in which you can calculate a similarity score in each columns, isn't it ?

ADD REPLY • link 14.0 years ago by Bilouweb ★ 1.1k

0

Entering edit mode

I don't think so. Lets say I have 1000 sequences in alignment A, and 50 sequences in alignment B. Lets say that a column from A with invariant Lys residue has been aligned with a column from B with invariant Val. The score should be low, but the score for the "combined" column would be high.

ADD REPLY • link 14.0 years ago by Jan Kosinski ★ 1.6k