Question

Self-Learning Gene-Expression K-Means Clustering In R

12

Entering edit mode

14.5 years ago

Eric Normandeau 11k

Hi,

I want to cluster gene expression in R using kmeans (or some other function/package) and I would like that the clustering be 'intelligent', in the sens that some within-cluster dissimilarity metric is being minimized, while avoiding over-splitting of clusters.

I have already tried kmeans, but do not want to specify an a-priory number of clusters. Here is the code:

data.xpr = read.table("my_data.txt") # Rows = 250 genes, cols = 32 individuals
clusters = kmeans(x = data.xpr, centers=20)

I am quite aware that there are a few other questions on the subject, but the answers are very broad and none permits to do what I would like to accomplish.

I would very much appreciate to have some code examples for R.

Cheers!

r clustering code • 19k views

ADD COMMENT • link updated 6.3 years ago by Ram 44k • written 14.5 years ago by Eric Normandeau 11k

Ram · Answer 1 · 2010-06-18

Of course there is more to it, but you will get more experience by reading about and trying out different methods on your data.

All clustering methods try to optimize a certain objective function based on dissimilarities. That does not give you a good clue to decide for or against an algorithm. Some algorithms require to give an estimate on the number of clusters present. Hierarchical clustering on the other hand does not require that. It can be used in R using hclust() or the amap package. Visualization such as heatmap() or heatmap2() can also be useful. My tip, try hclust with Ward's inter-cluster distance, too.

Model based clustering is implemented in the Mclust package. I found it very useful for MA data. It is "intelligent" in your sense in that it tries to guess optimal parameters by optimizing an information criterion (BIC). From the manual:

Model-based clustering (model and number of clusters selected via BIC).

Normal mixture modeling via EM for ten covariance structures.

Simulation from parameterized Gaussian mixtures.

Discriminant analysis via MclustDA.

Model-based hierarchical clustering for four covariance structures.

Displays, including uncertainty plots and random projections.

Recommendation: try many different methods:

PCA
Discriminant analysis if you have annotation data
hierarchical clustering
Model based clustering
Self organizing maps (cran package SOM)

Some methods for assessing clusters were discussed in this question

Hope this gives you some hints of where to proceede.

Ram · Answer 2 · 2010-07-13

Hi,

If you want to run the k-means partitioning algorithm on gene expression data I think you should better use the Kmeans function from the amap BioC library. Indeed, the default kmeans function use euclidean distance as dissimilarity metric. This is probably not the right choice (but it may depend on your needs...). A better solution could be:

library(amap)
data.xpr = read.table("my_data.txt") # Rows = 250 genes, cols = 32 individuals
clusters = Kmeans(x = data.xpr, centers=20, method="pearson")

Alternaltively, you can use use the DBFMCL algorithm (on a Linux OS as it requires MCL installation). It is implemented in the RTools4TB BioC package. Note that it has to be run on unfiltered datasets as it implements a filtering step based on density.

data.xpr <- read.table("my_data.txt") # The full dataset.
results  <- DBFMCL(data = m, distance.method = "pearson")
plotGeneExpProfiles(res,sign=1)

Ram · Answer 3 · 2010-06-18

3

Entering edit mode

14.5 years ago

Michael Kuhn 5.0k

I think you're looking for some kind of "figure of merit" calculation. MeV has implemented this, and is a nice package for interactive clustering of gene expression data. For R, the clValid package might do what you need.

ADD COMMENT • link updated 6.3 years ago by Ram 44k • written 14.5 years ago by Michael Kuhn 5.0k

score 2 · Answer 4 · 2017-10-25

2

Entering edit mode

7.2 years ago

Kevin Blighe 88k

You could use my parallelised implementation of clusGap, which computes the gap statistic for a given dataset via PAM or k-means (or a custom metric): R functions edited for parallel processing

ADD COMMENT • link 7.2 years ago by Kevin Blighe 88k

score 0 · Answer 5 · 2010-06-18

0

Entering edit mode

14.5 years ago

Will 4.6k

While I'm not sure about which R function there is, your probably looking for a "Chinese Restaurant Process" clustering.

I'm pretty sure there's an R library but its been a while since I've had a chance to look for it. I'm sure some google-ing would find it.

ADD COMMENT • link 14.5 years ago by Will 4.6k

2

Entering edit mode

Will, he was looking for cluster analysis not for a stochastic process, -1 for a "random-pick" wikipedia link...

ADD REPLY • link 14.5 years ago by Michael 55k