I've made a t-SNE plot of my data, I can show it here but unfortunatly I can't show you the labels. There are 4 different labels:
The plot was created using a dataframe called scores
, which contains approximately 1100 patient samples and 25 features represented by its columns. The labels for the plot were sourced from a separate dataframe called metadata
. The following code was used to generate the plot, utilizing the information from both scores
and metadata
dataframes.
tsneres <- Rtsne(scores, dims = 2, perplexity = 6)
tsneres$Y = as.data.frame(tsneres$Y)
ggplot(tsneres$Y, aes(x = V1, y = V2, color = metadata$labels)) +
geom_point()
My mission:
I want to analyze the t-SNE plot and identify which features, or columns from the "scores" matrix, are most prevalent in each cluster. Specifically, I want to understand which features are most helpful in distinguishing between the different clusters present in the plot. Is it possible to use an alternative algorithm, such as PCA, that preserves the distances between data points in order to accomplish this task? perhaps it's even a better choice than t-SNE?
This is an example of scores
, this is not the real data, but it's similar:
structure(list(Feature1 = c(0.1, 0.3, -0.2, -0.12, 0.17, -0.4,
-0.21, -0.19, -0.69, 0.69), Feature2 = c(0.22, 0.42, 0.1, -0.83,
0.75, -0.34, -0.25, -0.78, -0.68, 0.55), Feature3 = c(0.73, -0.2,
0.8, -0.48, 0.56, -0.21, -0.26, -0.78, -0.67, 0.4), Feature4 = c(0.34,
0.5, 0.9, -0.27, 0.64, -0.11, -0.41, -0.82, -0.4, -0.23), Feature5 = c(0.45,
0.33, 0.9, 0.73, 0.65, -0.1, -0.28, -0.78, -0.633, 0.32)), class = "data.frame", row.names = c("Patient_A",
"Patient_B", "Patient_C", "Patient_D", "Patient_E", "Patient_F",
"Patient_G", "Patient_H", "Patient_I", "Patient_J"))
Mensur Dlakic Hi, thanks for your reply! So in the towarddatascience example, they used K-means clustering and then explained the contribution of features using SHAP. I'm not sure if k-means is suitable for my goal. k-mean and t-SNE really are similar, I wonder if it could work..
What I need to do is just visualize the data as it is, then explain the contributions of features in each cluster, or what are the features that are most prominant in each cluster.
Will k-means be a suitable option for this aim in your opinion? and is it still practicable via t-SNE ?
To the best of my knowledge, t-SNE is not explainable.
K-means or kNN may or may not work as well at delineating groups compared to t-SNE plot, but it will be some kind of a clustering solution. Your dataset is tiny, so it should take only seconds per run to get many clustering solutions, which can be plotted and the contributions explained. If you want to visualize the data AND be able to explain the features, I think you may need to compromise and go with a clustering solution that doesn't look as clear-cut on paper. Even your t-SNE plot has at least two colors in most groups.
I tired doing the examples you suggesed, its not what I need. The k-means did cluster my data but it is not according to the labels, it's just according to the data itself without considering the labels, so it's not good for me :(
I already told you that t-SNE solution may be more visually appealing.
t-SNE also separates the data without considering the labels. If you had class labels in the data used with t-SNE, that will bias the embedding. Class labels, whether real or assumed, should never be a part of the clustering process unless one is doing supervised clustering, and explicitly discloses that fact.