I am stuck in a problem with hierarchical clustering. I want to make a dendrogram and a heatmap, with a distance method of correlation (d_mydata=dist(1-cor(t(mydata))) and ward.D2 as clustering method.
As a gadget in the package pheatmap you can plot the dendrogram on the left side to visualize the clusters.
The pipeline of my analysis would be this:
create the dendrogram test how many cluster would be the optimal (k) extract the subjects in each cluster create a heatmap My surprise comes up when the dendrogram plotted in the heatmap is not the same as the one plotted before even when methods are the same.
So I decided to create a pheatmap colouring by the clusters classified before by cutree and test if the colours correspond to the clusters in the dendrogram.
This is my code:
Create test matrix
test = matrix(rnorm(200), 20, 10) test[1:10, seq(1, 10, 2)] = test[1:10, seq(1, 10, 2)] + 3 test[11:20, seq(2, 10, 2)] = test[11:20, seq(2, 10, 2)] + 2 test[15:20, seq(2, 10, 2)] = test[15:20, seq(2, 10, 2)] + 4 colnames(test) = paste("Test", 1:10, sep = "") rownames(test) = paste("Gene", 1:20, sep = "") test<-as.data.frame(test)
Create a dendrogram with this test matrix
dist_test<-dist(test) hc=hclust(dist_test, method="ward.D2")
plot(hc)
dend<-as.dendrogram(hc, check=F, nodePar=list(cex = .000007),leaflab="none", cex.main=3, axes=F, adjust=F)
clus2 <- as.factor(cutree(hc, k=2)) # cut tree into 2 clusters groups<-data.frame(clus2) groups$id<-rownames(groups)
-----------DATAFRAME WITH mydata AND THE CLASSIFICATION OF CLUSTERS AS FACTORS---------------------
test$id<-rownames(test) clusters<-merge(groups, test, by.x="id") rownames(clusters)<-clusters$id
clusters$clus2<-as.character(clusters$clus2) clusters$clus2[clusters$clus2== "1"]= "cluster1" clusters$clus2[clusters$clus2=="2"]<-"cluster2"
plot(dend, main = "test", horiz = TRUE, leaflab = "none")
d_clusters<-dist(1-cor(t(clusters[,7:10]))) hc_cl=hclust(d_clusters, method="ward.D2")
annotation_col = data.frame( Path = factor(colnames(clusters[3:12])) ) rownames(annotation_col) = colnames(clusters[3:12])
annotation_row = data.frame( Group = factor(clusters$clus2) ) rownames(annotation_row) = rownames(clusters)
Specify colors
ann_colors = list( Path= c(Test1="darkseagreen", Test2="lavenderblush2", Test3="lightcyan3", Test4="mediumpurple", Test5="red", Test6="blue", Test7="brown", Test8="pink", Test9="black", Test10="grey"), Group = c(cluster1="yellow", cluster2="blue") )
require(RColorBrewer) library(RColorBrewer) cols <- colorRampPalette(brewer.pal(10, "RdYlBu"))(20) library(pheatmap) pheatmap(clusters[ ,3:12], color = rev(cols), scale = "column", kmeans_k = NA, show_rownames = F, show_colnames = T, main = "Heatmap CK14, CK5/6, GATA3 and FOXA1 n=492 SCALE", clustering_method = "ward.D2", cluster_rows = TRUE, cluster_cols = TRUE, clustering_distance_rows = "correlation", clustering_distance_cols = "correlation", annotation_row = annotation_row, annotation_col = annotation_col,
annotation_colors=ann_colors )
you are not scaling your data when you do hclust(dist(data)). But in pheatmap, you scale your data based on column ?
In pheatmap help section it says it uses hclust therefore, I think your error was caused by not giving the same input. pheatmap also have distance matrix output so check; 1) if your distance matrix == pheatmaps 2) make sure you scale your data as well in nonpheatmap way.
I assume you have to change
with something like
to use correlation distance.