I asked this question at StackOverflow but it seems no one can answer.
As far as I can see the two functions differ only when using Pearson's correlation as a distance. I do not know which one is correct.
I am trying to make pheatmap cluster columns in the same order as aheatmap.
I have looked at both functions, created a small example set, used the same clustering functions, yet they both give different answers.
set.seed( 1234 )
testm <- replicate(10, rnorm(20))
pt <- pheatmap( testm, clustering_distance_rows = "correlation", clustering_distance_cols = "correlation" )
at <- aheatmap( testm, Colv = "correlation", Rowv = "correlation", hclustfun = "complete" )
When looking at
pt$tree_col$order vs at$colInd
we see that they produce different cluster ordering. What is the difference in the functions and how do I make pheatmap give the same clustering output as aheatmap?
We can observe the different order by simple visual inspection of the heatmaps.
This is an example for the order of the columns:
hclust is always "complete".
When they both use Pearson's correlation as distance:
aheatmap: 9 8 10 3 2 7 4 6 1 5
pheatmap: 4 6 9 1 5 3 2 7 8 10
When I use Euclidean distance they both give: 9 4 6 1 5 8 10 3 2 7
For maximum distance they both give: 10 7 2 6 9 4 1 5 3 8
No offense, but taking into account that the author of aheatmap function made 2 typos in 1 installation line (intall.pacakges('NMF'), http://renozao.github.io/NMF/master/vignettes/aheatmaps.pdf ) - I would rather go with pheatmap
Or go with
ComplexHeatmap
which I found the most comprehensive package, even though you'll need some time to get your head around the principles as it is very heavy-loaded due to its plethora of functionalities. Still, a good investment I think.I've seen someone ask a similar question; why do these two packages produce slightly different results and how can I make them agree. There's a lot of discussion regarding pheatmap vs heatmap2.
The question is why do you want to make them agree?
When
correlation
is selected, they both calculate the distance matrix in the same way:pheatmap's default linkage method is 'complete', so, no difference there, either.
The difference likely lies in how the columns are re-ordered. Take a look at
reorderfun
.I have somewhat the same sentiment as Amar, though: why do you want them to agree?
If I change pearson's correlation to euclidean distance then they agree. So, the question is, which one implements pearson's correlation as distance the correct way. I doubt reorderfun would be different for different distance measures. I want to use the one that gives the correct answer when using pearson's correlation as distance.
I am reasonably sure they both correctly apply the parameters you give them, but you would extensively need to review the code to make sure all parameters are indeed identical. Please take no offense in the following sentence but I always find it odd that users make claims like
the result is not correct
simply based on the output not fitting their straight-forward expectations. There is not one correct output given the many factors that can influence a heatmap. There might be some details on how columns are grouped (as Kevin already pointed put). Please make sure you evaluate all of this before making claims that something is not correct. Again, please take no offense, the above sentences are not specifically pointed at you but rather to all users who aim to sort out unexpected differences between tools.What exactly is different? Are the major clusters the same or is it simply the order of the clusters itself in the visualzation?
This is an example for the order of the columns:
hclust is always "complete".
When they both use Pearson's correlation as distance:
When I use Euclidean distance they both give:
9 4 6 1 5 8 10 3 2 7
For maximum distance they both give:
10 7 2 6 9 4 1 5 3 8
This question is actually good. I also played with the data a bit and I am also lost why it gives different results - the code should lead to the same clustering for sure, but it is not the same.