Hi Less,
You touch on some really important questions, the answers to which will depend on what works best for your dataset so I highly recommend trying things multiple ways and seeing if they line up with your expectations of the data. The method for estimating K can also give varying results so again try several ways. I recently wrote a tutorial for estimating K for RNAseq data which you could find useful!
For these questions I'm assuming you're referring to expression data.
1) Is it mandatory to scale the dataset before? If yes, why?
I personally think this is important if you want to extract clusters based on gene profiles and not gene expression values. Without scaling your highest expressed genes will cluster as one group, your lowest as another etc. Scaling allows genes of similar profile to cluster together regardless of their absolute expression level.
2) Is it recommendable to remove outliers before? If yes, what's the best method to evaluate which outlier to exclude?
Outliers removal should be very carefully considered and only for good reason. You can always run it both ways and compare. Outlier samples can be identified by hierarchical clustering of the samples, outlier genes is another story. People sometimes filter lowly expressed genes as noise. You can also filter the genes a posteriori using their correlation to the cluster mean.
3) I have seen that the estimated K could be used also for hierarchical clustering.
It is indeed possible to extract K-clusters from a tree. This can be used as a way to cross validate your clusters.
Hey Kevin, i assume that going for medoids could be preferable because they are always members of the data set. In my case i want to cluster expressed genes based on few samples (8). The idea is cluster those genes in the best way possible (along with the clustering of my samples). They are about few hundreds so i was thinking about clustering them and then plot an heatmap of the pearson correlation values (or euclidean distance?) considering thus the resulting clusters and samples.
i was reading your paper but im not sure i understood the pipeline you followed. furthermore in factoextra (http://www.sthda.com/english/wiki/factoextra-r-package-easy-multivariate-data-analyses-and-elegant-visualization) they scale the data prior to gapstat analysis. im a bit confused.
Hey lessismore,
Yes, in the manuscript, the methods were buried in the supplementary. It's nothing groundbreaking, mind you, just some fancy twiddling with the medoids that PAM returns.
Yes, I can see from that tutorial that they perform scaling. I guess that it depends on the implementation of the Gap Statistic that they're using. clusGap, for example, logs the data with the following line:
logWks[b, k] <- log(W.k(z, k))
. I'm not sure that scaling outside the function and then logging inside it is ideal, but they may be using a different implementation. Something to test out.You want to plot the Pearson correlation values of which exactly (could use corrplot, in that case)? Euclidean distance is also fine, if your data is normally distributed, as QC'd and logged RNA-counts should be.
I would like to test
Yes, Euclidean distance is better for log2 counts. If you wanted to cluster normalised counts (assuming a negative binomial distribution), you should use correlation distance or a Poisson-based distance metric.
In both cases, the relationships between the samples will differ, slightly, and you'll notice a re-arrangement of some branches of both dendrograms. The overall 'feeling' of the clustering should not change that much, assuming that your data is normalised and QCd.
You'll have to define 'expression profiles'?
i refer to "expression profiles" when talking about scaling for rows (genes) and thus seeing a preferred trend in the expression across samples(and conditions). Could you define QCd? Thanks a lot for your answer Kevin
De nada amigo/a. 'QCd' is 'quality controlled' or 'controlled for quality'.
You can do the clustering using the normalised expression values, or the scaled normalised expressed values (by 'scaled', I refer to Z-scores).