"Representative sequences from a cluster", this notion seems to be adressing the medoid of a cluster. A medoid is similar to a centroid but in contrast to a centroid it is always a member of the cluster, while a centroid normally is not member of the data-set. A cluster medoid can be computed from the distance matrix alone, it is the cluster member with minimum average pairwise distance to all other members.
A good definition is here: http://www.unesco.org/webworld/idams/advguide/Chapt7_1_1.htm
Don't be confused, you do not need to run a k-medoids clustering. And there is ofc only one medoid per cluster (ignoring ties in distances)
Edit to address your comment:
There is no package needed for this. The steps in R are the following:
For each cluster:
- get the list of objects clustered in cluster C_i
- get distance matrix D and restrict to a distance matrix of only containing objects in C_i
- convert D into a full matrix (in R distance matrix is a lower triangular)
- find the row/column with minimal rowsum/colsum, that is the medoid of C_i
that is the object with minimum overall distance to all other objects in the cluster.
In the following example I am using the USArrests sample dataset, so your sequences are named like states of the USA, don't worry.
A simple one-line example:
> which.min (rowSums(as.matrix(dist(USArrests))))
Virginia
46
shows that "Virginia" [at index 46 in the data ] is the medoid of the whole data set.
A bit more complicated example, for all the clusters:
mydist = dist(USArrests)
clusters = cutree(hclust(mydist), k=5) # get 5 clusters
mydist = as.matrix(mydist) # get a full matrix
# function to find medoid in cluster i
clust.medoid = function(i, distmat, clusters) {
ind = (clusters == i)
names(which.min(rowSums( distmat[ind, ind] )))
# c(min(rowMeans( distmat[ind, ind] )))
}
sapply(unique(clusters), clust.medoid, mydist, clusters)
[1] "Michigan" "Missouri" "Kansas" "Florida" "New Hampshire"
Showing that Michigan, ..., are the medoids in the 5 clusters.
Bear in mind that this is probably the most accurate way of getting a representative sequence which is contained in the data; however, the consensus sequence of a multiple sequence alignment might be a more accurate representative of all sequences, though it is not a member of the data itself.
The best approach really depends on what the sequence clusters represent and what you are trying to accomplish. If you are just clustering to reduce sequence space and select representative sequences (like say Uniprot clusters) that is very different than if you are clustering on protein families...
btw, why don't you use a standard MSA approach?
Regarding your error: I think you have a cluster with only one member, then this function fails. I will edit your code, so it should work, have a look at the edits.
@Michael. Your code worked well. Thank you very much.