To scale or not to scale? - that is the question

Question

Computing the K for K-means clustering for a set of genes

4

Entering edit mode

7.1 years ago

lessismore ★ 1.4k

Conceptual doubts for computing the K:

1) Is it mandatory to scale the dataset before? If yes, why?
2) Is it recommendable to remove outliers before? If yes, what's the best method to evaluate which outlier to exclude?
3) I have seen that the extimated K coud be used also for hierarchical clustering.

Feel free to point me to some informative webpage.

These is the page from where the doubts came from: http://www.sthda.com/english/wiki/factoextra-r-package-easy-multivariate-data-analyses-and-elegant-visualization

Thanks in advance

K-means Statisics heatmaps clustering • 4.9k views

ADD COMMENT • link updated 7.1 years ago by Kevin Blighe 88k • written 7.1 years ago by lessismore ★ 1.4k

score 7 · Answer 1 · 2017-11-02

7

Entering edit mode

7.1 years ago

Kevin Blighe 88k

I have just published on this topic but in the context of metabolomics, asthma, and vitamin D: http://ajcn.nutrition.org/content/early/2017/08/23/ajcn.117.158220.abstract

Determining the optimum number of k centers for a dataset is one of the most challenging questions. In my work, I went by 3 different metrics in order to help choose the ideal number of centers, namely:

Gap statistic (Tibshirani, 2001) (bootstrapped 250x)
Silhouette coefficient (Rousseeuw, 1987)
Elbow method

To scale or not to scale? - that is the question

This is key. It depends on whichever method you are aiming to use. The Gap Statistic calculation, for example, involves a log and then single value decomposition transformation on your data, so, it needs to accept unlogged data. I have parallelised the R implementation of the Gap Statistic, called clusGap, here: Parallel processing in R

k-means or partitioning around medoids (PAM)?

Don't limit yourself to using k-means. PAM is a more flexible algorrithm and I argue is better for complex data. I used PAM in my published work above. When you find an ideal k and then run PAM with this value, it will return medoids:

Re-cluster using identified cluster centers

Yes, you can indeed later cluster your samples based on the identified number of cluster centers and the sample-to-cluster membership:

ADD COMMENT • link 3.9 years ago by Kevin Blighe 88k

0

Entering edit mode

Hey Kevin, i assume that going for medoids could be preferable because they are always members of the data set. In my case i want to cluster expressed genes based on few samples (8). The idea is cluster those genes in the best way possible (along with the clustering of my samples). They are about few hundreds so i was thinking about clustering them and then plot an heatmap of the pearson correlation values (or euclidean distance?) considering thus the resulting clusters and samples.

i was reading your paper but im not sure i understood the pipeline you followed. furthermore in factoextra (http://www.sthda.com/english/wiki/factoextra-r-package-easy-multivariate-data-analyses-and-elegant-visualization) they scale the data prior to gapstat analysis. im a bit confused.

ADD REPLY • link 7.1 years ago by lessismore ★ 1.4k

0

Entering edit mode

Hey lessismore,

Yes, in the manuscript, the methods were buried in the supplementary. It's nothing groundbreaking, mind you, just some fancy twiddling with the medoids that PAM returns.

Yes, I can see from that tutorial that they perform scaling. I guess that it depends on the implementation of the Gap Statistic that they're using. clusGap, for example, logs the data with the following line: logWks[b, k] <- log(W.k(z, k)). I'm not sure that scaling outside the function and then logging inside it is ideal, but they may be using a different implementation. Something to test out.

You want to plot the Pearson correlation values of which exactly (could use corrplot, in that case)? Euclidean distance is also fine, if your data is normally distributed, as QC'd and logged RNA-counts should be.

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

I would like to test

- if scaling for 1. normalized counts or 2. log2 values (for the eucliden distance is better the second?)
- what difference there would be plotting the heatmap based on euclidean distance or Pearson correlation across the genes? 
- what difference there would be plotting the heatmap based on euclidean distance or Pearson correlation across the samples?
- clustering by expression values or expression profiles? both information are relevant i think.

ADD REPLY • link 7.1 years ago by lessismore ★ 1.4k

1

Entering edit mode

if scaling for 1. normalized counts or 2. log2 values (for the eucliden distance is better the second?)

Yes, Euclidean distance is better for log2 counts. If you wanted to cluster normalised counts (assuming a negative binomial distribution), you should use correlation distance or a Poisson-based distance metric.

what difference there would be plotting the heatmap based on euclidean distance or Pearson correlation across the genes?

what difference there would be plotting the heatmap based on euclidean distance or Pearson correlation across the samples?

In both cases, the relationships between the samples will differ, slightly, and you'll notice a re-arrangement of some branches of both dendrograms. The overall 'feeling' of the clustering should not change that much, assuming that your data is normalised and QCd.

clustering by expression values or expression profiles? both information are relevant i think.

You'll have to define 'expression profiles'?

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

i refer to "expression profiles" when talking about scaling for rows (genes) and thus seeing a preferred trend in the expression across samples(and conditions). Could you define QCd? Thanks a lot for your answer Kevin

ADD REPLY • link 7.1 years ago by lessismore ★ 1.4k

1

Entering edit mode

De nada amigo/a. 'QCd' is 'quality controlled' or 'controlled for quality'.

You can do the clustering using the normalised expression values, or the scaled normalised expressed values (by 'scaled', I refer to Z-scores).

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

score 2 · Answer 2 · 2017-11-02

2

Entering edit mode

7.1 years ago

Jake Warner ▴ 840

Hi Less,
You touch on some really important questions, the answers to which will depend on what works best for your dataset so I highly recommend trying things multiple ways and seeing if they line up with your expectations of the data. The method for estimating K can also give varying results so again try several ways. I recently wrote a tutorial for estimating K for RNAseq data which you could find useful!

For these questions I'm assuming you're referring to expression data.

1) Is it mandatory to scale the dataset before? If yes, why?

I personally think this is important if you want to extract clusters based on gene profiles and not gene expression values. Without scaling your highest expressed genes will cluster as one group, your lowest as another etc. Scaling allows genes of similar profile to cluster together regardless of their absolute expression level.

2) Is it recommendable to remove outliers before? If yes, what's the best method to evaluate which outlier to exclude?

Outliers removal should be very carefully considered and only for good reason. You can always run it both ways and compare. Outlier samples can be identified by hierarchical clustering of the samples, outlier genes is another story. People sometimes filter lowly expressed genes as noise. You can also filter the genes a posteriori using their correlation to the cluster mean.

3) I have seen that the estimated K could be used also for hierarchical clustering.

It is indeed possible to extract K-clusters from a tree. This can be used as a way to cross validate your clusters.

ADD COMMENT • link 7.1 years ago by Jake Warner ▴ 840

0

Entering edit mode

Indeed i was talking about outlier genes. Their effect however can be minimized by scaling, correct? Cant wait to read your document.

ADD REPLY • link 7.1 years ago by lessismore ★ 1.4k

0

Entering edit mode

Defining what is an outlier is often not a simple task. Typically an outlier implies an unusual/abnormal data point but what is unusual depends on what you consider usual/normal. In many experiments outliers can be data points different from controls. Whether you want to remove them or actually focus on them depends on the context, e.g. in screens, genes with phenotypes are outliers/different from controls but they are those we're interested in. Since you haven't told anything about the context in which you're working and what you're trying to do, it's not possible to give you more precise answers.

ADD REPLY • link 7.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Can i ask you why you cluster genes by pearson correlation and samples by Spearman?

hr <- hclust(as.dist(1-cor(t(scaledata), method="pearson")), method="complete") # Cluster rows by Pearson correlation.
hc <- hclust(as.dist(1-cor(scaledata, method="spearman")), method="complete") # Clusters columns by Spearman correlation.

ADD REPLY • link 7.1 years ago by lessismore ★ 1.4k

score 2 · Answer 3 · 2017-11-02

2

Entering edit mode

7.1 years ago

Jean-Karim Heriche 27k

In k-means, the number of clusters k is a parameter given to the algorithm, i.e. k-means gives you the number of clusters you ask for. There is no universal way of determining k. The method mentioned at the bottom of the page you linked to uses the gap statistics to try and find a good value of k but there are other methods. You can use this value of k for other clustering algorithms that need the number of clusters as input, in the case of hierarchical clustering, this can be used to cut the tree into this number of clusters.
Scaling and outlier removal are data pre-processing steps. If and how they should be applied depends on the data and the question at hand.

ADD COMMENT • link 7.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I have a doubt. Literally what does it mean "cut the tree"? 1.You remove info? or your force the info to cluster in this predefined number of clusters?

ADD REPLY • link 7.1 years ago by lessismore ★ 1.4k

0

Entering edit mode

Hierarchical clustering produces a tree showing the relationships between the data points given the similarity measure you used, i.e. it shows what is most similar to what but doesn't output clusters. For this, you need to separate/disconnect branches from the tree, this is called cutting the tree. The problem is that you can often cut the tree in several ways and still get the same number of clusters.

ADD REPLY • link 7.1 years ago by Jean-Karim Heriche 27k