Hi everyone ,
I'm fairly new here and currently working in a biology lab for my master's internship.
I'm in charge of doing the single cell analysis of foetal tissues and I came across some questions.
After running any clustering algorithm based on cell expression profiles, how one can asses the goodness of the fit ? I've seen here and there people searching to tackle this problem with p-values : I don't find relevant to compute p-values of expression profile on data clustered based on their expression profile ( data dredging ? )
What do you guys use in that case ? Cluters' Silhouette width ? Hopkin's test ? Cohen's d ? I'm kind of lost here.
Thanks a lot for reading me !
Clustering is in the eye of the beholder. A given clustering algorithm typically makes assumptions and optimizes a specific objective function. The choice of assumptions and objective function may or may not lead to a clustering that you would consider good (i.e. define good). What matters is how relevant the clusters are to your research question. You may have a perfect fit to the data from a mathematical point of view but that doesn't guarantee relevance and conversely, a poor fit could simply indicate that you've chosen an unsuitable clustering algorithm. The measure of fit to use depends in part on the clustering algorithm you use. There is no consensus on the matter but a useful measure should reflect your research question. Typically this means looking at the distribution of some relevant properties between clusters, i.e. whether clusters are enriched in a specific label. In the absence of prior knowledge, there's no way to assess if the clusters are relevant without collecting additional information from each cluster.
Maybe in your case you want to know if clustering can group cells from the same tissue of origin. If you know the tissue of origin for some of the cells, you could check how often cells from the same origin end up in the same cluster.
Thank you very much for your reply ! I will try to discuss this with my biologist colleagues