I am trying to cut the dendrogram tree using the package dynamicTreeCut, I prefer dynamic cutting and clustering. I run the code below
clusDyn <- cutreeDynamic(hr, distM = as.matrix(as.dist(1-cor(t(scaledata)))), method = "hybrid")
However, it produces 160 clusters, which is too many to analyze each one of them individually. Is it possible to tell to cut tree dynamically but also to group them in such a way that it produces only a specific number of clusters? For example, I would like 20 clusters after the dynamic tree cut instead of 160 clusters.
I know that if I cut the dendrogram at a specific height then I could possibly decide the number of clusters it would generate but I prefer Dynamic tree cutting.
This is happening because the input is a simple correlation matrix that is affected by spurious or missing connections (see this paper).
I am very new to RNAseq analysis and clustering. Can you please elaborate on it, do you mean to say that Pearson correlation is not enough for this clustering and I should look for other methods? Is WGCNA a better workflow?
Help me to understand. Is this a clustering analysis of differentially expressed genes or an unsupervised clustering analysis (eg WGCNA)?
These are differentially expressed genes, which are around 15K genes from a total of 30 K genes. Then I follow the clustering protocol as given in this link (the genes are scaled and then clustered by Pearson correlation)- https://2-bitbio.com/2017/04/clustering-rnaseq-data-making-heatmaps.html
I don't think the
cutreeDynamic
function will work very well with a distance matrix calculated from pearson correlation values:as.matrix(as.dist(1-cor(t(scaledata))))
. Just to be sure, how did you calculatehr
(the link doesn't work for me)?thank you for the effort, I did calculate the hr as you have shown. hr <- hclust(as.dist(1-cor(t(scaledata), method="pearson")), method="complete")
As it seems that Pearson correlation values do not work well with cutreeDynamic, can you please suggest something that I can look into, to make a better correlation matrix?
Look, I am not familiar with workflows used for the detection of clusters of differentially expressed genes. What I can tell you is that
cutreeDynamic
, with the default settings, doesn't work very well when the distance matrix is calculated just from pearson correlation values.If you want to use
cutreeDynamic
, there are settings that you can change in oder to reduce number of clusters. For example, see:minClusterSize, deepSplit, cutHeight
, andmaxCoreScatter
(usage)Hi #andres.firrincieli,
Although it's late, hope to have your helpful answer. regarding cutreeDynamic, you recommended changing some settings like cutHeight. so, we have to determine cutHeight value even with cutreeDynamic. my understanding was we do not need to specify the cutHeight parameter explicitly for cutreeDynamic, it is not correct, right?
I typically set the minimum cluster size to 100 and leave the others with the default settings.