Question

Clustering Data (Rna-Seq) Using R To Produce A Heatmap

19

Entering edit mode

13.0 years ago

Kanne ▴ 450

I have RNA-seq data (FPKMs) from Cufflinks and would like to cluster it by gene and produce a heatmap.

This is my first try at using R and I have spent a LOT of time pouring over the manual/help pages and internet tutorials on how to do this.

I can now produce heatmaps using "heatmap" easily enough, my problem is that I can produce them from many different versions/transformations of my data and I cannot figure out what is going on and which heatmap is the analysis I am interested in.

What I am trying to get is a) gene names clustered by expression profile, to mine for enriched gene groups/pathways; and b) a heatmap of FPKM values, with the same gene clustering.

This is the R code: Data input/preparation

 m <- data.frame(read.table("DMSTSC1000_notmeanctrd.txt", header=T, sep="\t"))
 row.names(m) <- m$test_id
 m <- m[,2:7]
 m_matrix <- data.matrix(m)

Making Heatmap version 1:

heatmap(m_matrix, Colv=NA, scale="column")

Making Heatmap version 2. This came about because a paper described using a Pearson correlation metric with clustering, but this heatmap looks terrible, clustering appears to bear little relationship with imaged data:

cor_t <- cor(t(m_matrix))
distancet <- as.dist(cor_t)
hclust_complete <- hclust(distancet, method = "complete")
dendcomplete <- as.dendrogram(hclust_complete)
heatmap(m_matrix, Rowv=dendcomplete, Colv=NA, scale="column")

Making Heatmap version 3

distancem <- dist(m_matrix)
hclust_completem <- hclust(distancem, method = "complete")
dendcompletem <- as.dendrogram(hclust_completem)
heatmap(m_matrix, Rowv=dendcompletem, Colv=NA, scale="column")

Or, if you have code for a fourth way that you're confident about, I'd love to hear it! I tried to use pam but haven't been able to produce a heatmap from it yet.

Sorry about not uploading images, I haven't figured out how to web-host them yet.

Details: FPKM data has been log2 transformed and high outliers were capped at a maximum value (10), to increase the range of colors used for the majority of the data.

Thank you in advance for your help, it is very much appreciated!!

r rna heatmap clustering gene • 64k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 13.0 years ago by Kanne ▴ 450

0

Entering edit mode

Maybe you should change the title since, from what I understood, it seems your problem is more about choosing clustering methods than generating and analyzing heatmaps which you seem to know how to do.

ADD REPLY • link 13.0 years ago by Philippe ★ 1.9k

0

Entering edit mode

True, will do, thanks!

ADD REPLY • link 13.0 years ago by Kanne ▴ 450

score 18 · Answer 1 · 2011-11-11

To shorten your search: there is no correct answer and no best method for choosing distance measures in cluster analysis, if there was everybody would be using it. In data-mining, there are a gazillion of methods, and each method has different characteristics, making different aspects in the data visible. The idea is not to rely on a single best method, but try several that will aid your process to generate new hypotheses about the data.

That said, there is one important requirement for distance measures, which is not valid in your choice of correlation as a distance. I'd phrase it like that: similar objects have close to d = 0, dissimilar objects have d>0, the more dissimilar the larger d, however correlation range is -1<= r <=1 and has adverse behavior, so there are at least some possibilities with different characteristics to turn correlation into distance:

correlation distance d := 1 - r (anti-correlation: d=2, no correlation, d=1, full correlation: d=0 )
absolute correlation distance: d := 1-|r| (edit: d := |1-r| was a little mistake, because the result is identical to the first distance)
r-squared distance: d := 1 - r^2 (no correlation d=1, anti- and full correlation: d=0)

This explains why your attempt using correlation distance didn't work out. Therefore try the following R-code, and see if it improves things:

cor_t <- 1 - cor(t(m_matrix)) # or
cor_t <- 1 - abs(cor(t(m_matrix))) # edited
cor_t <- 1 - cor(t(m_matrix))^2

These are still no real distance metric because they break the triangle inequality, but still.

score 2 · Answer 2 · 2013-11-24

You might also want to try out heatmap.2 from the gplots package:

http://cran.r-project.org/web/packages/gplots/index.html

http://mannheimiagoesprogramming.blogspot.com/2012/06/drawing-heatmaps-in-r-with-heatmap2.html

I think it has a little more functionality that is useful for gene expression visualization.

Ram · Answer 3 · 2015-04-08

I can now produce heatmaps using "heatmap" easily enough, my problem is that I can produce them from many different versions/transformations of my data and I cannot figure out what is going on and which heatmap is the analysis I am interested in."

HeatmapGenerator has a database storage system which stores any heatmap you have ever produced along with its corresponding name so that you can always refer back to a heatmap you made in the past from a central repository. Source: http://sourceforge.net/projects/heatmapgenerator/