I want to do global sequence aligment. Referring to the wikipedia article of Needleman-Wunsch for global sequence alignment. I could see the following similarity matrix
A G C T
A 10 -1 -3 -4
G -1 7 -5 -3
C -3 -5 9 0
T -4 -3 0 8
Instead of having these values I want to deal with the probability. So all the values will not be integers but real values.
In all of the matrices I had seen e.g BLOSUM50, all have the integer values. My question is that if the similarity matrix will be filled with real values. Would it work the same way.
Another thing is that, is there any special procedure to construct the similarity matrix or it could be any distance measure.
The algorithm does not exclude floating point numbers, but many implementations of the algorithm do (e.g. ggsearch36 in the FASTA package), or the older align.c program in the FASTA2 package use integer matrices, because for a long time, integer arithmetic was much faster.
There are several papers on producing similarity matrices at arbitrary evolutionary distances. The Jones, Taylor, and Thornton paper (Comp. Appl. Biosci 1992 8:275-282, Comp. Appl. Biosci. is now Bioinformatics) is widely cited and their matrices widely used. More recently, Mueller, Spang and Vingron have a series of papers, including Mol. Biol. Evol., 2002 19:8-13.
However, it is not clear to me that different scoring matrices will have much effect on global (Needleman Wunsch) sequence alignments. Since those alignments are global, they must align from beginning to end, and the only difference different scoring matrices will have will depend on the gap penalties. For local alignments, the matrices matter a lot more (see Altschul's classic paper: J. Mol. Biol. (1991) 219:555).
First of all:
Yes it is possible. There is absolutly no reason why one should not use float values.
But keep in mind:
The scores of the popular substitution-matrizes are computed using probabilities so I do not expect better results.
As stated in the comments:
An natural of combining probabilities and alignments are Pair-HMMs. With these you can set your desired probabilities and ask for the optimal alignment (viterbi-algorithm).
@peri4n thanks a lot for the answer... I only have the match probabilities not transition ones... I will dig more into Pair-HMMs, I am a newbee to HMMs... It would be very nice of you if you could please recommend standard libraries and code plus usage examples...
Thanks a lot...
Using floating point values with NW is fine; however, using probabilities is not.
If you want to work with probabilities of alignments you need to multiply their values instead of adding. The Needleman-Wunsch algorithm uses addition since it deals with the logarithm of probabilities. This is sound since log(x * y) = log x + log y.
For a good introduction into the statistical motivation of sequence alignment algorithms, see e.g. Biological Sequence Analysis by Durbin & al.
The whole reason why matrices such as BLOSUM use integral values is that they allow vastly faster computation than floating point numbers. Furthermore, these matrices are actually derived from log’ed (and scaled) probabilities.
Wouldn't you rather want to use a hidden markov model when working with probabilities?