Hi all,
I have seen two widespread practices when dealing with correlation heatmaps and I am not exactly sure what's best, if such concept applies here. Since I use the pheatmap
R package, I will use it in my examples:
1) Give a correlation matrix to the drawing function, which calculates the distances between the data values using its default clustering method. The output is a heatmap with a scale between -1 and 1, where 1 indicates maximum similarity. In pheatmap, that would be as follows:
cor.matrix <- cor(data)
pheatmap(cor.matrix)
For instance, one can observe this behaviour here: [1], [2]. And although I am not sure they have been also generated in this way, many papers present a correlation heatmap with a scale that seems to suggest they are clustering correlation and not distance between samples: [3], [4].
2) Convert correlation to distance and use the distance object both for visualization AND distance clustering. Again, to give an example with pheatmap:
sampleDists <- as.dist(1-cor)
pheatmap(sampleDists, clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists)
Different tutorials on the internet recommend this, such as [5] or [6]. Besides, this is also the recommended approach by DESeq deverlops Michael Love and Simon Anders in their RNA-seq workflow, although they use Euclidean distance instead of Pearson correlation ([7]).
To me, the second option makes more sense. As Love and Anders summarize in their paper ([7]): " Otherwise the pheatmap function would assume that the matrix contains the data values themselves, and would calculate distances between the rows/columns of the distance matrix, which is not desired". However, that doesn't seem compatible with the fact that many published papers use the correlation matrix as data values and calculate the Euclidean distances between them. In fact this one ([4]): states it quite explicitly: "We also plotted the heatmap of the matrix of Pearson correlations between the 26 samples, using the pheatmap function from the pheatmap package v1.0.2 14 with default settings (i.e. complete linkage hierarchical clustering using the Euclidean distances). "
Results do not seem to diverge enormously, though in my dataset I seem to observe somewhat better results with the second approach.
PS: This question might have been asked before, but I am not very deft with the search feature on this site and I haven't found any convincing answer,
I was also just wondering about this. Which one did you end up using and why? thanks!