Hi all,
I have a data set where the rows are genes and columns are phylogenetic profiling scores.
I clustered this dataset of genes in Hierarchial Clustering in R, and got a dendogram of the hclust() output. I need to identify the number of clusters, so that genes in the same cluster will be very similar to each other according to the values of the columns, (genes that have similar values in the same columns belong to the same cluster) and basicly split the data into modules. I need to find a systematic way to do that, simultaneously on a lot of datasets, without the involvement and optimization of human.
I used the function NbClust() which gave a not enough appropriate output as some genes appear in the same cluster although they are not enough similar:(
I would really appriciate to get an idea of a R function to take out genes that are not related to the cluster, or a better function to determine the best number of clusters that consider the possibility to not include some genes.
Thank you!
Thank you for your answer. What I am trying to do is to write a code that will identify number of clusters in a lot of datasets, simultaniously. Therefore, I am trying to find a systematic way to do that, without the involvment of human. So how can I apply your suggestion in a R code, and let the algorithm determine the 'h' or the 'k' in cutree() based on the Pearson correlation?
Whilst automation is good, you should never completely disengage from the computer. There are instances where automated processes fail us, and across various industries, sometimes with fatal consequences.
To do what you want to do, just set up a loop to look over each dataset and then output results in a simple text file or to terminal output for you to then screen them. If you want ideas for loops, look at my code here: R functions edited for parallel processing
Note that you can save object names in a vector and then call them one-by-one in a loop: