Hello I want to figure out if there are genetic clusters on a time series samples (93 samples), I used mash(https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) to generate a distance matrix of 93x93 tha looks like following one:
A B C D E F G H I J K L
A 0 20 20 20 40 60 60 60 100 120 120 120
B 20 0 20 20 60 80 80 80 120 140 140 140
C 20 20 0 20 60 80 80 80 120 140 140 140
D 20 20 20 0 60 80 80 80 120 140 140 140
E 40 60 60 60 0 20 20 20 60 80 80 80
F 60 80 80 80 20 0 20 20 40 60 60 60
G 60 80 80 80 20 20 0 20 60 80 80 80
H 60 80 80 80 20 20 20 0 60 80 80 80
I 100 120 120 120 60 40 60 60 0 20 20 20
J 120 140 140 140 80 60 80 80 20 0 20 20
K 120 140 140 140 80 60 80 80 20 20 0 20
L 120 140 140 140 80 60 80 80 20 20 20 0
how can I input this matrix on a clustering algorithm ? I used kmeans funciton in R getting clusters but it might not be a good idea to cluster data on this function with a distance matrix as input (kmeans function calculates distances using different methods)
In other words,is there any clustering function in R that supports a distance matrix as input?
I was taking a look to this one as a feasible clustering algorithm for my dataset: https://onlinelibrary.wiley.com/doi/10.1002/9780470316801.ch3 (k-medioids for large datasets) but dont know if it is possible to input a distance matrix
Thanks for reading :)
You can convert a matrix of distances M into a dist object with as.dist(M). Using hclust is generally a good idea to start exploring cluster structure, e.g.
yeah it worked! thanks, now I'm trying to figure out if there is a method for estimating the proper number of clusters to loof for in a dendogram, do you know any?
What a cluster is is in the eye of the beholder. There is no good answer to this question. It often depends on the granularity we want to have, e.g. do we want to have a cluster of all blood cells or do we want to separate red from white blood cells?
You can try the dynamicTreeCut package to find clusters in a dendrogram. If no obvious structure is visible in the dendrogram you may want to explore the underlying feature space a bit more, for example with dimensionality reduction methods: where do the sample fall when plotted on the first two PCA components? Does UMAP/t-SNE reveal any meaningful clustering? Note that you can also do clustering in a reduced dimensionality space.