Hi guys,
I used clustal omega to get a distance matrix of 500 protein sequences (they are homologous to each other).
I want to use affinity propagation to cluster these sequences.
Initially, because I observed by hand that the distance matrix only had values between 0 and 1, with 0 distance = 100% identity, I reasoned that I could just take (1 - distance) to get affinity.
I ran my code, and the clusters looked reasonable, and I thought all was well... until I read that typically, affinity matrices are calculated from distance matrices by applying a "heat kernel". That's when all hell broke loose in my mind.
Did I get the concept of affinity matrix incorrect? Is there an easy way of computing the affinity matrix? scikit-learn offers the following formula:
similarity = np.exp(-beta * distance / distance.std())
But what is beta
, and what is distance.std()
?
I'm quite confused and lost right now with the concepts involved (as opposed to the actual coding implementation), so any help is greatly appreciated!