Dear All,
I have a fundamental question in microarray data analysis. The defaults of almost every heat map function in R does the hierarchical clustering first, then scales the rows then displays the image. But as you can imagine, doing the scaling first and clustering second significantly changes the appearance of a heat map as well as the clustering.
So my main question is, in their essence, is both solutions acceptable? Or only one of the strategies is correct? If so, why not the other one?
Thanks a lot!
I see your point. Thank you very much for explaining with a figure, it makes understanding so easier. I used euclidean distance, so scaling based on genes reduces the distances between genes as expression levels will center around zero.
From what you have said, I can say that scaling before clustering will significantly affect clustering. But it's not clear for me whether this difference is biologically significant.
For instance, assume there is a gene A, which expression levels range from 21 to 25 in different groups, and gene B, which expression levels range from 11 to 15 and their standard deviations is such that when we scale those two genes, hypothetically, they both come into the range of -1 to 1. So for this two genes, if i cluster with euclidean distance first they will be far far away from each other, while if I cluster after scaling they will be closer.
So if I think correctly, scaling first then clustering will help me to visualize the overall trends in expression levels when comparing different groups, but i will essentially lose their expression level distances. But then question becomes to again this, is this loss biologically relevant/significant? I mean, is the knowledge of gene A expressed in 10 units, and gene B expressed in 1 units is biologically significant at all in a comparison context? To be honest, my intuition says no, since we want to compare different groups, not interested in the base expression level of genes. But it's not clear how the microarrays are done in many of the papers. How they done scaling, and most importantly when they done scaling with relative to clustering according to which distance measure?
Sorry for such blabbering, but for a starter like me, even such a little detail becomes frustrating, since i am not sure to trust my intuition yet and literature is not that clear in this little bit.
You are asking the correct questions. A non-uncommon approach is Euclidean-based distance for clustering (we need to talk about linkage functions here, but I'll leave that for you to read) followed by scaling of genes for display. However, that is not to say that there is a standard. In practice, try different combinations to see what you get.