Similarity matrix to distance matrix for protein sequences

0

Entering edit mode

4.7 years ago

kbaitsi • 0

I have used R to calculate a similarity matrix for 11 proteins (histones) from a fasta file. Then I need to turn the similarity matrix into a distance matrix in order to use it in hclust. I have used sim2dist and also dist with all methods (euclidean, maximum, manhattan, canberra, binary, minkowski). I have excluded the binary method but I am not sure which is the best way to calculate the distance from the rest of my options. Any thoughts?

similarity distance protein sequences r • 3.7k views

ADD COMMENT • link 4.7 years ago by kbaitsi • 0

1

Entering edit mode

There are a few common and generic ways of turning a similarity into a distance such as:

d = max(s) - s (e.g. if similarity is cosine then max(s) = 1)
d = 1/(s+1)
d = exp(- s^a) with a being a parameter In fact, any function that is strictly decreasing will do.

ADD REPLY • link 4.7 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thank you for your answer, sim2dist does what you wrote in the first bullet. I was just wondering if there is a preferable way when it comes to protein sequences or it doesn't matter?

ADD REPLY • link 4.7 years ago by kbaitsi • 0

3

Entering edit mode

What matters most is the choice of the original measure of similarity. It has to capture the notion of proximity/similarity that is relevant to the question you're trying to address. When converting you need to make sure that distribution properties that are important for the clustering are preserved.

ADD REPLY • link 4.7 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I have used the pairwise alignment function and a blosum subtitution matrix. Thanks a lot for your time and answer.

ADD REPLY • link 4.7 years ago by kbaitsi • 0

Login before adding your answer.