Question

Iterative clustering with silhouette score

0

Entering edit mode

5 months ago

sleepystudent • 0

Hi all!

I am analyzing a single-cell dataset consisting of one cell line in 4 conditions (control, treatment1, treatment2, combination). I integrated all the 4 samples using Seurat package and want to cluster it to further assess changes in clusters percentages between samples.

However, I don't have a specific resolution to choose. When clustering with different resolutions, I observed some clusters have the same functions (based on enrichment analysis) while others, apparently, combine several ones (i.e., mitotic division, oxphos and immune response). Depending on the resolution, cluster sizes with the (according to enrichment) same function could both increase and decrease when comparing treatment vs control. As I have suggestions on what's happening under the treatments (based on bulk-seq, proteomics and PCR data), the choice of optimal resolution based on functional analysis seemed biased to me. Furthermore, as it's not a cell population, I don't think to see some specific subpopulations based on markers, as with usual single-cell.

Then I tried to analyse the mean silhouette metric on different resolutions, but it was about 0.25 (weak clustering), and with the resolution increasing many clusters had negative silhouette values.

So I decided to implement an iterative clustering according to this scheme. I calculate the mean silhouette scores for each cluster for different resolutions, choose the cluster with the highest score, assign a new cluster and repeat the procedure without this one. I finish clustering when all of the subclusters have silhouette score under 0.5 (normal clustering).

As I didn't see nowhere this approach, I'm asking myself if it is mathematically and biologically correct?

Looking forward to your suggestions/replies! Thanks!

clustering resolution silhouette single-cell • 607 views

ADD COMMENT • link updated 5 months ago by Mensur Dlakic ★ 29k • written 5 months ago by sleepystudent • 0

score 0 · Answer 1 · 2024-11-22

You didn't tell us what type of clustering is used here. Specifically, how do you determine the number of clusters?

With the information you provided, this sounds biased to me. My understanding is that silhouette scores have the same meaning only within the same dataset. Since you are iteratively removing points from the graph, a silhouette score of 0.5 in 5th iteration does not mean the same as 0.5 in 1st iteration. In other words, you are relaxing the silhouette criterion in subsequent iterations, as there are fewer data points in the graph and new solutions are more likely to not have any neighboring data points, which will result in a better score.

I suggest that you experiment / iterate with cluster numbers and try to find a solution that outright gives you a relatively high mean silhouette score. Even if you expect the number of clusters to be 15, the best solution might be with 10 or 25 clusters. Afterwards you can hopefully make sense of those clusters in the light of biological functions.