I am dealing with similar task. Here are my findings that may be of help for somebody else.
- Somatic mutations are said to be spare and heterogeneous. So using them for clustering is not going to be straightforward task. Before jumping to clustering methods there are suggestions on how you may go to de-sparsify your data. For instance, knowledge on gene-gene network are usually considered for data de-sparsification . A detailed discussion could be find here. I am not-covering desparsification methods in this answer.
- It is tricky to cluster categorical data, because it could lead to non-sense and wrong conclusion!
- In contrast to classic clustering, your matrix here is not numerical. So you DONT allow to use common algorithms like k-mean clustering. If you apply , it wont complain about your data type and provide you result!
- Mutational matrix is binary or categorical.
These are steps needed for clustering:
Dissimilarity matrix calculation
In the first step you should calculate a dissimilarity matrix for clustering . Again there is difficulty regarding to math calculation on categorical/binary data. To do this, you would go for something called Gower distance. this method is available in cluster
R base package.
Also there are methods available in vegan
R package appropriate to be applied on binary data: binomial
, raup
and jaccard
. It depends on your data and your decision to chose what method.
Choosing Clustering algorithms
Choosing the clustering algorithm is the next step. For categorical data you would go for hierarchical clustering (either agglomerative or divisive approach). The final steps would be assessing the clustering result.
Below I am providing what I used for my case in short.
You did not provide details on your input data, so exact code is not possible to post here. But the following are the general steps you can follow to cluster your samples.
1- Making mutation count/binary matrix: In my case, I am dealing with TCGA data, and so there are maf files and could be converted to the matrix by maftools
package by mutCountMatrix
function.This will provide a count matrix. you may need to convert it to binary (0,1) code.
library(maftools)
mtx <- mutCountMatrix(maf, includeSyn = FALSE, countOnly = NULL, removeNonMutated = FALSE) #maf file contains mutation infor
#transpose mtx to have genes in columns and samples in row
mtx <- t(mtx)
#Convert counts to binary
mtx.b <- apply(mtx, 2, function(x) ifelse(x > 0, 1, x)) # So 0 = no, 1 =yes
2- Making dissimilarity matrix:
#gower by cluster package
library(cluster)
gower <- daisy(mtx.b, metric = c("gower"))
# binimoal by vegan package
library(vegan)
binomial <- vegdist(mtx.b, binary = TRUE, method = "binomial")
3- Applying clustering (most common agglomerative hierarchical clustering) and plotting
#gower
gower.aggl.clust<- hclust (gower, method = "complete")
plot(gower.aggl.clust, cex = 0.6, main = "Agglomerative, complete linkages")
#binomial
binom.aggl.clust<-hclust(binomial) #agglomerative clustering using complete linkage
plot(clust.res, cex = 0.6, main = "Agglomerative, complete linkages")
There are a lots of details one should be aware of them for clustering. For instances, how you would evaluate clustering result and .... I tried to provide some practical hints toward clustering of mutational data.
Any comment, modification and elaboration of this answer is appreciated in advance.
Try Affinity Propagation, it's basically magic..
Clustering is about grouping items by similarity/proximity. You need to define what similarity/proximity is relevant in your case, i.e. what should items in the same cluster share that would differentiate them from another cluster. This helps in selecting the similarity measure used for clustering. Then the selection of clustering algorithm can be dependent on some knowledge/assumption about the cluster structure.
Please include sample data and What do you expect the result to be and what was you result when you launch your analysis? With that answer we can improve your analysis.