I have sequence data in the form of a matrix as follows.
matrix=
[
[78 65 78 ... 84 65 65]
[78 71 78 ... 71 71 71]
[78 67 78 ... 84 65 84]
...
[65 65 65 ... 65 65 71]
[67 67 71 ... 84 65 65]
[65 71 65 ... 65 65 84]
]
The shape of this matrix is (105772, 151). My goal is to cluster them into two clusters using http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc230 library. In order to create clusters, I must need some sort of similarity method to determine the similarity between various reads in my data. Biopython provides a method to compute the distance between various reads with the following method of Euclidean distance.
distances = distancematrix(matrix, dist='e')
Here "e" represents the Euclidean distance. For my data of the shape (105772, 151), it took me around 18 minutes to compute the distances.
After that, I tried to create clusters using kmeans with the following python code.
clusterid, error, nfound = kcluster(matrix, dist='e')
it took me less than a minute top run the above command.
I am wondering that why kcluster command from bio.python takes so less time whereas the distance computation takes more time? As in order to create clusters, we must first create the distance matrix on the basis of which we do all other calculations. So in a way, kcluster includes computation of distances by default?
Or may be, my understanding about kcluster and distances from bio-python is wrong?