Check what are the columns that are enriched in each t-SNE cluster
1
0
Entering edit mode
22 months ago
JACKY ▴ 160

I've made a t-SNE plot of my data, I can show it here but unfortunatly I can't show you the labels. There are 4 different labels:

enter image description here

The plot was created using a dataframe called scores, which contains approximately 1100 patient samples and 25 features represented by its columns. The labels for the plot were sourced from a separate dataframe called metadata. The following code was used to generate the plot, utilizing the information from both scores and metadata dataframes.

tsneres <- Rtsne(scores, dims = 2, perplexity = 6)
tsneres$Y = as.data.frame(tsneres$Y)
ggplot(tsneres$Y, aes(x = V1, y = V2, color = metadata$labels)) + 
  geom_point()

My mission:

I want to analyze the t-SNE plot and identify which features, or columns from the "scores" matrix, are most prevalent in each cluster. Specifically, I want to understand which features are most helpful in distinguishing between the different clusters present in the plot. Is it possible to use an alternative algorithm, such as PCA, that preserves the distances between data points in order to accomplish this task? perhaps it's even a better choice than t-SNE?

This is an example of scores, this is not the real data, but it's similar:

structure(list(Feature1 = c(0.1, 0.3, -0.2, -0.12, 0.17, -0.4, 
-0.21, -0.19, -0.69, 0.69), Feature2 = c(0.22, 0.42, 0.1, -0.83, 
0.75, -0.34, -0.25, -0.78, -0.68, 0.55), Feature3 = c(0.73, -0.2, 
0.8, -0.48, 0.56, -0.21, -0.26, -0.78, -0.67, 0.4), Feature4 = c(0.34, 
0.5, 0.9, -0.27, 0.64, -0.11, -0.41, -0.82, -0.4, -0.23), Feature5 = c(0.45, 
0.33, 0.9, 0.73, 0.65, -0.1, -0.28, -0.78, -0.633, 0.32)), class = "data.frame", row.names = c("Patient_A", 
"Patient_B", "Patient_C", "Patient_D", "Patient_E", "Patient_F", 
"Patient_G", "Patient_H", "Patient_I", "Patient_J"))
r Dimensionality-reduction MachineLearning Rtsne • 1.0k views
ADD COMMENT
0
Entering edit mode
22 months ago
Mensur Dlakic ★ 28k

t-SNE may not be the best choice if you want to understand feature contributions to the observed groupings of data points. I think PCA is more likely to be useful here, although its principal components are not directly equivalent to individual features.

In general, you should be able to use any clustering method and try to make it explainable, as in the example below:

https://towardsdatascience.com/how-to-make-clustering-explainable-1582390476cc

In fact, having an explanation for clusters may help in subsequently getting a supervised clustering:

https://www.aidancooper.co.uk/supervised-clustering-shap-values/

ADD COMMENT
0
Entering edit mode

Mensur Dlakic Hi, thanks for your reply! So in the towarddatascience example, they used K-means clustering and then explained the contribution of features using SHAP. I'm not sure if k-means is suitable for my goal. k-mean and t-SNE really are similar, I wonder if it could work..

What I need to do is just visualize the data as it is, then explain the contributions of features in each cluster, or what are the features that are most prominant in each cluster.

Will k-means be a suitable option for this aim in your opinion? and is it still practicable via t-SNE ?

ADD REPLY
0
Entering edit mode

To the best of my knowledge, t-SNE is not explainable.

K-means or kNN may or may not work as well at delineating groups compared to t-SNE plot, but it will be some kind of a clustering solution. Your dataset is tiny, so it should take only seconds per run to get many clustering solutions, which can be plotted and the contributions explained. If you want to visualize the data AND be able to explain the features, I think you may need to compromise and go with a clustering solution that doesn't look as clear-cut on paper. Even your t-SNE plot has at least two colors in most groups.

ADD REPLY
0
Entering edit mode

I tired doing the examples you suggesed, its not what I need. The k-means did cluster my data but it is not according to the labels, it's just according to the data itself without considering the labels, so it's not good for me :(

ADD REPLY
0
Entering edit mode

I already told you that t-SNE solution may be more visually appealing.

t-SNE also separates the data without considering the labels. If you had class labels in the data used with t-SNE, that will bias the embedding. Class labels, whether real or assumed, should never be a part of the clustering process unless one is doing supervised clustering, and explicitly discloses that fact.

ADD REPLY

Login before adding your answer.

Traffic: 1816 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6