Does Needlman-Wunsch Algorithm Works With Float Values In Similarity Matrix?
3
3
Entering edit mode
12.7 years ago
Shan ▴ 50

Hi

I want to do global sequence aligment. Referring to the wikipedia article of Needleman-Wunsch for global sequence alignment. I could see the following similarity matrix

            A       G        C        T
   A       10      -1       -3       -4
   G       -1       7       -5       -3
   C       -3      -5        9        0
   T       -4      -3        0        8

Instead of having these values I want to deal with the probability. So all the values will not be integers but real values.

In all of the matrices I had seen e.g BLOSUM50, all have the integer values. My question is that if the similarity matrix will be filled with real values. Would it work the same way.

Another thing is that, is there any special procedure to construct the similarity matrix or it could be any distance measure.

Thanks a lot.

sequence alignment similarity • 3.6k views
ADD COMMENT
0
Entering edit mode

Wouldn't you rather want to use a hidden markov model when working with probabilities?

ADD REPLY
10
Entering edit mode
12.7 years ago
Bill Pearson ★ 1.0k

The algorithm does not exclude floating point numbers, but many implementations of the algorithm do (e.g. ggsearch36 in the FASTA package), or the older align.c program in the FASTA2 package use integer matrices, because for a long time, integer arithmetic was much faster.

There are several papers on producing similarity matrices at arbitrary evolutionary distances. The Jones, Taylor, and Thornton paper (Comp. Appl. Biosci 1992 8:275-282, Comp. Appl. Biosci. is now Bioinformatics) is widely cited and their matrices widely used. More recently, Mueller, Spang and Vingron have a series of papers, including Mol. Biol. Evol., 2002 19:8-13.

However, it is not clear to me that different scoring matrices will have much effect on global (Needleman Wunsch) sequence alignments. Since those alignments are global, they must align from beginning to end, and the only difference different scoring matrices will have will depend on the gap penalties. For local alignments, the matrices matter a lot more (see Altschul's classic paper: J. Mol. Biol. (1991) 219:555).

ADD COMMENT
2
Entering edit mode
12.7 years ago
Fabian Bull ★ 1.3k

First of all: Yes it is possible. There is absolutly no reason why one should not use float values.

But keep in mind: The scores of the popular substitution-matrizes are computed using probabilities so I do not expect better results.

As stated in the comments: An natural of combining probabilities and alignments are Pair-HMMs. With these you can set your desired probabilities and ask for the optimal alignment (viterbi-algorithm).

ADD COMMENT
0
Entering edit mode

@peri4n thanks a lot for the answer... I only have the match probabilities not transition ones... I will dig more into Pair-HMMs, I am a newbee to HMMs... It would be very nice of you if you could please recommend standard libraries and code plus usage examples... Thanks a lot...

ADD REPLY
0
Entering edit mode
12.7 years ago
Konrad ▴ 80

Using floating point values with NW is fine; however, using probabilities is not.

If you want to work with probabilities of alignments you need to multiply their values instead of adding. The Needleman-Wunsch algorithm uses addition since it deals with the logarithm of probabilities. This is sound since log(x * y) = log x + log y.

For a good introduction into the statistical motivation of sequence alignment algorithms, see e.g. Biological Sequence Analysis by Durbin & al.

The whole reason why matrices such as BLOSUM use integral values is that they allow vastly faster computation than floating point numbers. Furthermore, these matrices are actually derived from log’ed (and scaled) probabilities.

ADD COMMENT

Login before adding your answer.

Traffic: 2169 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6