Question

hierarchial clustering of genes- identify number of clusters

0

Entering edit mode

6.9 years ago

hodayabeer ▴ 10

Hi all,

I have a data set where the rows are genes and columns are phylogenetic profiling scores.

I clustered this dataset of genes in Hierarchial Clustering in R, and got a dendogram of the hclust() output. I need to identify the number of clusters, so that genes in the same cluster will be very similar to each other according to the values of the columns, (genes that have similar values in the same columns belong to the same cluster) and basicly split the data into modules. I need to find a systematic way to do that, simultaneously on a lot of datasets, without the involvement and optimization of human.

I used the function NbClust() which gave a not enough appropriate output as some genes appear in the same cluster although they are not enough similar:(

I would really appriciate to get an idea of a R function to take out genes that are not related to the cluster, or a better function to determine the best number of clusters that consider the possibility to not include some genes.

Thank you!

gene hierarchial clustering cluster R • 1.6k views

ADD COMMENT • link updated 6.9 years ago by Kevin Blighe 88k • written 6.9 years ago by hodayabeer ▴ 10

score 1 · Answer 1 · 2018-01-08

1

Entering edit mode

6.9 years ago

Kevin Blighe 88k

For extracting information from the clustering, take a look at my answer here: A: extract dendrogram cluster from pheatmap This is a very crude way of deciding ideal cluster number, though, due to the fact that you the human is deciding where to cut the tree manually, although, if you cluster using correlation distance as the dissimilarities, then you can easily say that you identified cluster groups based on Pearson correlation>0.9, for example.

Other ways of deciding ideal cluster number in a dataset include —but are not limited to—:

Silhouette method
Elbow metod
gap statistic
Consensus Clustering

All of these have implementations in R.

I published on this recently in the context of asthma and vitamin D: Vitamin D prenatal programming of childhood metabolomics profiles at age 3 y.

ADD COMMENT • link 6.9 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you for your answer. What I am trying to do is to write a code that will identify number of clusters in a lot of datasets, simultaniously. Therefore, I am trying to find a systematic way to do that, without the involvment of human. So how can I apply your suggestion in a R code, and let the algorithm determine the 'h' or the 'k' in cutree() based on the Pearson correlation?

ADD REPLY • link 6.9 years ago by hodayabeer ▴ 10

0

Entering edit mode

Whilst automation is good, you should never completely disengage from the computer. There are instances where automated processes fail us, and across various industries, sometimes with fatal consequences.

To do what you want to do, just set up a loop to look over each dataset and then output results in a simple text file or to terminal output for you to then screen them. If you want ideas for loops, look at my code here: R functions edited for parallel processing

Note that you can save object names in a vector and then call them one-by-one in a loop:

mat1 <- matrix(rexp(50, rate=0.1), ncol=10)
mat2 <- matrix(rexp(50, rate=0.1), ncol=10)
mat3 <- matrix(rexp(50, rate=0.1), ncol=10)

MyDataMatrices <- c("mat1", "mat2", "mat3")

for (i in 1:lengt(MyDataMatrices))
{
       currentDataMatrix <- get(MyDataMatrices[i])
       ...
       [do processing on currentDataMatrix]
}

ADD REPLY • link 6.9 years ago by Kevin Blighe 88k