Clustering using codon usage similarity
0
1
Entering edit mode
8.1 years ago
Saad Khan ▴ 440

Hi,

I have a codon usage similarity matrix that I got from somewhere. Most clustering algorithms start with data which characteristically looks like the iris dataset n rows (as observations) and x columns as features. Most R packages don't start with a distance matrix directly and apply their own distance function on the data like "euclidean", "Minkowski" etc. But Since I am directly starting with a distance matrix I was wondering if someone could provide me insight as to how in the first place cluster the matrix and then get the optimal number of clusters from data. Almost all the methods (R-packages) described here (http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters) do not take/accept distance matrix as input. R packages like dbscan (http://www.sthda.com/english/wiki/dbscan-density-based-clustering-for-discovering-clusters-in-large-datasets-with-noise-unsupervised-machine-learning) do accept input but you have a problem of defining "eps: Reachability maximum distance" and "MinPts: Reachability minimum number of points" beforehand. So I was wondering if anyone who has gone through similar issues can provide me examples and/or workaround to my problem.

codon usage clustering • 2.5k views
ADD COMMENT
1
Entering edit mode

Hi, I am not sure if I understand it correctly... But if you have your distances already (the similarity matrix) and want to cluster immediately with these (instead of calculating euclidean distances), I think you can use as.dist.

e.g.,

HC <- hclust(as.dist(matrix))
plot(HC)
ADD REPLY
0
Entering edit mode

Note that as.dist only coerces the matrix into a dist object. The content doesn't 'magically' become interpretable as a distance. If you have a matrix of similarities, you first need to convert it to distances (i.e. dissimilarities). Using a similarity matrix when a distance matrix is expected will usually produce the wrong result because a high distance value means a low similarity and vice versa. There are various ways of converting a similarity into a distance, one is simply D(i,j)=max(S)-S(i,j).

ADD REPLY
0
Entering edit mode

How about other more robust methods (r-packages) like K-means, PAM(K-medoids) and mclust etc

ADD REPLY
0
Entering edit mode

I confirm the first comment but we shoold respect the distance object forma

ADD REPLY

Login before adding your answer.

Traffic: 2937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6