Hi everyone !
I am working on a K-medoid algorithm, but I have trouble understanding how unsupervised learning works, your help will be more than welcome.
Each index is an organism. So if length of the matrix is 5 x 5, then there are 5 organisms to be compared (in my real dataset I have a lot more ^^), and index 0 means first organism and so on.
The distance matrix represents the distance of Manhattan calculated between two organisms, from the gene function present or absent (COG).
Here is my code (which works so far), I couldn't do the last part - the iteration (this is just a start) - because I can't figure out when my loop should stop : should I put a threshold myself, like the number of iteration maximum ?
Is there an automatic way to do that ?
What do you think of this option (line while not ((old_medoids == curr_medoids).all()):
) ? I think it's too ramdom, just expecting that we get twice the same group of medoids...
And how does the computer "learn" not to go back the points that doesn't optimise the algorithm, and keep the "profitable" points ? Should I create a list of elements "already seen - not to be done" ?
And last optional question : I need to create a function to optimize k (k being the number of clusters chosen, by the user or by the computer). I do not have a clue how I could do that.
If you have any idea in any of those points or comments about my code, thank you in advance :)
import random as rd
import numpy as np
# DATA = distances matrix
# matrix [2][1] => distance between point 2 and point 1
# matrix is symetric => matrix [2][1] is the same as [1][2] but we only use [2][1]
distances = np. array([[ 0. , 0. , 0. , 0. , 0. ],
[ 0.64745477, 0. , 0. , 0. , 0. ],
[ 0.36053849, 0.57299117, 0. , 0. , 0. ],
[ 0.3580143 , 0.59402608, 0.18258309, 0. , 0. ],
[ 0.35128313, 0.615061 , 0.23727387, 0.24821203, 0. ]])
def initialize_medoids(distances, k=2):
'''
Take the distances matrix, send back a list of k random elements => called medoids
=> example : [4, 2]
'''
first_medoids = rd.sample([i for i in range(len(distances))], k)
return first_medoids
def associate_point_to_medoids(current_medoids, distances):
'''
Each point (not in the medoid list) is associated with his closest medoid
=> result in a list of lists : [[medoid1, his closest medoids], [medoid2, his closest medoids]]
=> example : [[4, 1], [2, 3, 0]]
'''
configuration = [[x] for x in current_medoids]
for i in range(len(distances)):
if i not in current_medoids:
d = 1
associate = -1
for j in range(len(current_medoids)):
if i > current_medoids[j] :
if d > distances[i][j]:
d = distances[i][j]
associate = j
if i < current_medoids[j] :
if d > distances[j][i]:
d = distances[j][i]
associate = j
configuration[associate].append(i)
return configuration
def calculate_cost(configuration, distances):
'''
Sum the distances between points and their respective medoids.
=> take this type of argument : [[4, 1], [2, 3, 0]]
'''
summ = 0
for elt in configuration :
for i in range(1, len(elt)):
if elt[0] > elt[i] :
summ += distances[elt[0]][elt[i]]
else :
summ += distances[elt[i]][elt[0]]
return summ
def switch_medoids(n, current_medoids):
'''
n : define which medoids we're going to change
=> send back new_medoids
'''
new_medoids = list(current_medoids)
index = new_medoids.index(n)
new_medoids[index] = rd.choice([i for i in range(len(distances)) if i not in current_medoids])
return new_medoids
first_medoids = initialize_medoids(distances)
clusters = associate_point_to_medoids(first_medoids, distances)
old_cost = calculate_cost(clusters, distances)
# iteration manual...
n = first_medoids[1]
new_medoids = switch_medoids(n, first_medoids)
new_clusters = associate_point_to_medoids(new_medoids, distances)
new_cost = calculate_cost(new_clusters, distances)
if new_cost > old_cost :
# keep the old medoid => change only the index
n = first_medoids[0]
print 'no, the old configuration was good'
#reiterate
else :
# keep the new medoid => we can keep the same index or change it
n = new_medoids[1]
print 'yes, keep this one'
# reiterate
Here are a sample of the kind of results that can be obtained :
first_medoids
Out[358]: [3, 4]
clusters
Out[359]: [[3, 0, 1, 2], [4]]
old_cost
Out[360]: 1.13462347
new_medoids
Out[361]: [3, 1]
new_clusters
Out[362]: [[3, 0, 2, 4], [1]]
new_cost
Out[363]: 0.78880942000000009