Hi,
I have to do a heatmap from fold change data. I got a large diversity in the values (for example a few are between 1200 and 500, but the vast majority ot them are below 2).
According to what I could read, I suppose I have to "scale" my heatmap but I can't understand why and how it is done. I couldn't find any clear information about it.
Can somebody explain?
Thanks
welcome to Biostars. Please elaborate on your question. What kind of data is this, and how did you obtain the fold-changes? Reporting a log2-FC appears reasonable given you have a large data range, but this always depends on the context. It would also help to link the sources where you read about the scaling.
Thank you very much for your warm welcome and for all your very kind and interesting answers.
Sorry for not having given more details about my data. They are coming from PCR array experiments (3 cell types infected or not by viruses). The fold change values are an output of the Qiagen's "RT2 Profiler PCR Array Data Analysis Webportal".
A simple "pheatmap" command with my values and a "greenred" color code gave me an almost full lightgreen heatmap with a few red lines. I wanted to send you an image but I couldn't find the way to do it.
Indeed a new heatmap with log2FC transformed values helped me to see a few more red and black lines, but the heatmap always contains a lot of lightgreen lines. So I think I will have to scale my data to have a better visualisation. But is there really a biological signification behind this scaling ? How can I relate a scaled-heatmap with a non-scaled one ? It's not totally clear to me.
My distribution is absolutely not normal. So if I understood well, I should use correlation instead of Euclidean. Is it correct ?
In my former lab, I was also recommanded to use a "Ward" linkage. What do you think about it ?
My distribution is absolutely not normal. So if I understood well, I
should use correlation instead of Euclidean. Is it correct ?
As per the comments, you can use Euclidean distance. However, if you try to publish it and a reviewer is someone like me, then you would receive a yellow card for using Euclidean on a non-normal distribution - nothing major, though. For me, correlation distance with Spearman's rho coefficient would be more appropriate.
You can also use Ward's linkage,. However, be sure to select 'ward.D2' and not 'ward'. 'ward' is not the true Ward's linkage method... a sort of bug that has been identified but left in the R hclust() function.
Please do not add answers unless you're answering the top level question. This belongs as a comment-reply to Kevin's comment, which you can add using the Add Comment or Add Reply options.
I understand better now what is the purpose of the "scaling".
Nevertheless, I still don't understand its effect on my data values. For example the gene that has the highest FC value in one condition and so appears in red in a heatmap without scaling ('redgreen' heatmap color range), now appears in a dark red color in the heatmap with scaling. It seems that the scaling modified this value, lowering it and transforming it in a value close to 0.5 on a 1 to -1 ranging scale (?).
And it's the same for genes that has a lower FC value that now after scaling got values close to the maximum (which is 1). On the scaled heatmap those gene appears with a very bright red color suggesting gene expression being higher whereas their FC are lower.
I think I quite don't understand the interpretation of "scaled" heatmap whose values are not following FC values and the signification of the ranging from 1 to -1. Could you help me please to interpret them ?
There is no standard protocol to draw a heatmap, because a heatmap is a visualization tool that helps you visualize what you wish to show. You need to be clear what you wish to show and the way to create a heatmap for that will surface when you search with that clarity.
Scaling a data set with center=TRUE, scale=TRUE is essentially transforming values to the Z-scale, smoothing out differences. It makes the data look good at the cost of losing information on how extreme an outlier is. But then, you're going to map colors to the scale anyway, so how extreme an outlier is will not matter much.
To follow on from ATpoint: plotting log2 fold changes is fine, or even just plotting normalised values on the negative binomial scale. There are no standards.
In the case where your data is not normally-distributed ('bell curve'), though, you should be using something like Spearman correlation distances instead of Euclidean distances, i.e., for the purposes of hierarchical clustering.
Scaling the data (to the Z-scale) just helps to 'even out' any creases that may still exist in the data, which helps for visualisation. It's strictly not a necessary procedure and there are no standards.
Also realise the distinction: we can cluster the data on one distribution (e.g. log2 expression values) for generating row and column dendrograms, and then use a different distribution (e.g. Z scores) for display in the heatmap.
Scaling the data (to the Z-scale) just helps to 'even out' any creases that may still exist in the data, which helps for visualisation. It's strictly not a necessary procedure and there are no standards.
This is what I was thinking also. It reduces the number of outliers if the expression varies wildly.
You can use hierarchical clustering as it doesn't assume a distribution nor does Euclidean distance. Euclidean distance is more general than Pearsons correlation as Pearsons does assume normality. Euclidean can even be used to cluster binary data. A z-score does assume normality, we normally do this for features when they have been transformed using log2, so maybe this is what you were thinking of? I think this may be violated a bit, but isn't much of a problem.
It's also not clear whether your talking about the distribution in terms of the ith feature for all samples or jth sample for all features? This is important too. For instance, the feature distribution assuming we have only biological replicates should be a negative binomial when talking about CPM.
Euclidean distance works 'better' when the distribution is normally-distributed (i.e. a Gaussian). However, it's not a life or death situation to which we are referring here.
"Consequently, analytic tools that assume a Gaussian distribution
(such as classification methods based on linear discriminant analysis
and clustering methods that use Euclidean distance) may not perform as
well for sequencing data as methods that are based upon a more
appropriate distribution."
In relation to correlation distance, let us clarify so that people do not mis-interpret your words:
Pearson correlation is parametric; Spearman correlation is non-parametric. If your data is not normally-distributed, use Spearman correlation distances.
I understand better now what is the purpose of the "scaling".
Nevertheless, I still don't understand its effect on my data values. For example the gene that has the highest FC value in one condition and so appears in red in a heatmap without scaling ('redgreen' heatmap color range), now appears in a dark red color in the heatmap with scaling.
It seems that the scaling modified this value, lowering it and transforming it in a value close to 0.5 on a 1 to -1 ranging scale (?).
And it's the same for genes that has a lower FC value that now after scaling got values close to the maximum (which is 1). On the scaled heatmap those gene appears with a very bright red color suggesting gene expression being higher whereas their FC are lower.
I think I quite don't understand the interpretation of "scaled" heatmap whose values are not following FC values and the signification of the ranging from 1 to -1. Could you help me please to interpret them ?
To rephrase my question, I don't understand why we have to "scale" the data, is a log2 transfomration not enough ?
Hi francois.piumi,
welcome to Biostars. Please elaborate on your question. What kind of data is this, and how did you obtain the fold-changes? Reporting a log2-FC appears reasonable given you have a large data range, but this always depends on the context. It would also help to link the sources where you read about the scaling.
Thank you very much for your warm welcome and for all your very kind and interesting answers.
Sorry for not having given more details about my data. They are coming from PCR array experiments (3 cell types infected or not by viruses). The fold change values are an output of the Qiagen's "RT2 Profiler PCR Array Data Analysis Webportal".
Here are my main sources : http://www.opiniomics.org/you-probably-dont-understand-heatmaps/ https://github.com/slowkow/slowkow.com/blob/master/_rmd/2017-02-16-heatmap-tutorial.R https://bioramble.wordpress.com/2015/07/30/heatmaps-part-2-how-to-create-a-heatmap-with-r-in-an-ideal-world/
A simple "pheatmap" command with my values and a "greenred" color code gave me an almost full lightgreen heatmap with a few red lines. I wanted to send you an image but I couldn't find the way to do it. Indeed a new heatmap with log2FC transformed values helped me to see a few more red and black lines, but the heatmap always contains a lot of lightgreen lines. So I think I will have to scale my data to have a better visualisation. But is there really a biological signification behind this scaling ? How can I relate a scaled-heatmap with a non-scaled one ? It's not totally clear to me.
My distribution is absolutely not normal. So if I understood well, I should use correlation instead of Euclidean. Is it correct ?
In my former lab, I was also recommanded to use a "Ward" linkage. What do you think about it ?
Thanks for your answers.
Francois
As per the comments, you can use Euclidean distance. However, if you try to publish it and a reviewer is someone like me, then you would receive a yellow card for using Euclidean on a non-normal distribution - nothing major, though. For me, correlation distance with Spearman's rho coefficient would be more appropriate.
You can also use Ward's linkage,. However, be sure to select 'ward.D2' and not 'ward'. 'ward' is not the true Ward's linkage method... a sort of bug that has been identified but left in the R
hclust()
function.Thank you very much Kevin!
Please do not add answers unless you're answering the top level question. This belongs as a comment-reply to Kevin's comment, which you can add using the
Add Comment
orAdd Reply
options.I understand better now what is the purpose of the "scaling".
Nevertheless, I still don't understand its effect on my data values. For example the gene that has the highest FC value in one condition and so appears in red in a heatmap without scaling ('redgreen' heatmap color range), now appears in a dark red color in the heatmap with scaling. It seems that the scaling modified this value, lowering it and transforming it in a value close to 0.5 on a 1 to -1 ranging scale (?).
And it's the same for genes that has a lower FC value that now after scaling got values close to the maximum (which is 1). On the scaled heatmap those gene appears with a very bright red color suggesting gene expression being higher whereas their FC are lower.
I think I quite don't understand the interpretation of "scaled" heatmap whose values are not following FC values and the signification of the ranging from 1 to -1. Could you help me please to interpret them ?
sorry to ask again my question, but could someone help me to understand data scaling? why couln't I find a standard protocol to draw a heatmap?
Please read Kevin's comment here: C: Heatmap : why scaling ?
There is no standard protocol to draw a heatmap, because a heatmap is a visualization tool that helps you visualize what you wish to show. You need to be clear what you wish to show and the way to create a heatmap for that will surface when you search with that clarity.
Scaling a data set with
center=TRUE, scale=TRUE
is essentially transforming values to the Z-scale, smoothing out differences. It makes the data look good at the cost of losing information on how extreme an outlier is. But then, you're going to map colors to the scale anyway, so how extreme an outlier is will not matter much.