Differences Between Eisen Hierarchical Clustering In Bioconductor/R Versus Heatmap.2?
1
4
Entering edit mode
11.3 years ago
user ▴ 950

I'm comparing two ways of creating heatmaps with dendrograms in R, one with made4's heatplot (from bioconductor) and one with gplots of heatmap.2. The appropriate results depend on the analysis but I'm trying to understand why the defaults are so different, and how to get both functions to give the same result (or highly similar result) so that I understand all the 'blackbox' parameters that go into this.

This is the example data and packages:

require(gplots)
# made4 from bioconductor
require(made4)
data(khan)
data <- as.matrix(khan$train[1:30,])

Clustering the data with heatmap.2 gives:

heatmap.2(data, trace="none")

enter image description here

Using heatplot gives:

heatplot(data)

enter image description here

very different results and scalings initially. heatplot results look more reasonable in this case so I'd like to understand what parameters to feed into heatmap.2 to get it to do the same, since heatmap.2 has other advantages/features I'd like to use and because I want to understand the missing ingredients.

heatplot uses average linkage with correlation distance so we can feed that into heatmap.2 to ensure similar clusterings are used (based on https://stat.ethz.ch/pipermail/bioconductor/2010-August/034757.html)

dist.pear <- function(x) as.dist(1-cor(t(x)))
hclust.ave <- function(x) hclust(x, method="average")
heatmap.2(data, trace="none", distfun=dist.pear, hclustfun=hclust.ave)

resulting in:

enter image description here

this makes the row-side dendrograms look more similar but the columns are still different and so are the scales. It appears that heatplot scales the columns somehow by default that heatmap.2 doesn't do that by default. If I add a row-scaling to heatmap.2, I get:

heatmap.2(data, trace="none", distfun=dist.pear, hclustfun=hclust.ave,scale="row")

enter image description here

which still isn't identical but is closer. How can I reproduce heatplot's results with heatmap.2? What are the differences?

edit2: it seems like a key difference is that heatplot rescales the data with both rows and columns, using:

if (dualScale) {
    print(paste("Data (original) range: ", round(range(data), 
        2)[1], round(range(data), 2)[2]), sep = "")
    data <- t(scale(t(data)))
    print(paste("Data (scale) range: ", round(range(data), 
        2)[1], round(range(data), 2)[2]), sep = "")
    data <- pmin(pmax(data, zlim[1]), zlim[2])
    print(paste("Data scaled to range: ", round(range(data), 
        2)[1], round(range(data), 2)[2]), sep = "")
}

This is what I'm trying to import to my call to heatmap.2. The reason I like it is because it makes the contrasts larger between the low and high values, whereas just passing zlim to heatmap.2 gets simply ignored. How can I use this 'dual scaling' while preserving the clustering along the columns? All I want is the increased contrast you get with:

heatplot(..., dualScale=TRUE, scale="none")

compared with the low contrast you get with:

heatplot(..., dualScale=FALSE, scale="row")

Any ideas on this?

clustering r bioconductor gene-expression • 15k views
ADD COMMENT
4
Entering edit mode
11.3 years ago

I think you've identified the biggest differences between the default behaviors: first and foremost it's this "dual scaling" business, and secondly there are slight differences in the default clustering behavior.

To reproduce the diagram using heatmap.2 alone:

library(gplots)
library(RColorBrewer)

data(khan, package="made4")
data <- as.matrix(khan$train[1:30,])

cols <- rev(colorRampPalette(brewer.pal(10, "RdBu"))(256))

## Massaging the data
d <- t(scale(t(data)))         ## "dual scaling"
d <- pmin(pmax(d, -3), 3)      ## Compressing data to max/min +/- 3

## Setup the dendrograms
colv <- as.dendrogram(hclust(as.dist(1 - cor(d)), method="ave"))
rowv <- as.dendrogram(hclust(as.dist(1 - cor(t(d))), method="ave"))

heatmap.2(d, trace="none", Rowv=rowv, Colv=colv, scale="none", col=cols, density.info="none")

This dual scaling maneuver is ... interesting. I typically take a different tack -- whether or not I scale (here, the dual scaling forces you to scale the rows), I perform the clustering on the raw (or scaled w/o compressed) data, and then create a the color map using a compressed version of the data so that the colors aren't washed out in the heatmap, but the clustering pattern is unaffected by the data compression.

I also think that the aheatmap function in NMF package creates nice (annotated) heatmaps.

ADD COMMENT
0
Entering edit mode

extremely helpful thanks. still confused about a few things: isn't scaling by rows only exactly what you want for the heatmap (so that genes across samples are comparable), so scale="row" but without scaling the columns - i.e. without t(scale(t(data)))? i'd prefer to avoid any scaling (except rows which seems reasonable) if it's not needed. Clipping the scales seems legitimate for visualization if it doesn't distort the result. Second, why do you need to manually set the column and row dendrograms here as opposed to just letting heatmap.2 do it? It seems that heatmap.2 and heatplot yield different column dendrograms. for example EWS.T14 and EWS.T1 are typically close to EWS.6 in heatmap.2 but far apart in heatplot. Why is that?

ADD REPLY
1
Entering edit mode

A few quick comments to hopefully answer your Qs:

  • There is no default answer to "should I scale the rows or the columns" .. if you have enough samples (columns), I can imagine a scenario where scaling the rows would be beneficial (so you see relative changes of the gene across samples) -- for that to make the most sense, you might want some type of reassurance that the expression distribution across your samples look similar (are made to be similar)
  • t(scale(t(data))) doesn't scale the columns, it scales the rows.
  • I'm manually setting the Rowv and Colv dendrograms here, but you could have rather just passed in the appropriate hclustfun and distfun functions parameterized correctly.
  • With the code I use here, I'm not getting any differences in the ordering of the heatmaps between heatmap.2 and the original heatplot call, so I don't know what you are referring to re: the relative positinos of EWS.T14, EWS.T1, etc.

Note that you probably want to set symkey=TRUE and symbreaks=TRUE so your colors are set to be white at 0 -- notice how your "white" color is somewhere between 0 and 1.

ADD REPLY

Login before adding your answer.

Traffic: 1873 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6