Question

Choosing the number of components in a DMM model

0

Entering edit mode

3.3 years ago

Chris Dean ▴ 420

I am looking to identify community types from a large, longitudinal 16S marker-gene dataset using Dirichlet Multinomial Mixture models (DMMs). The primary goal of this model is to identify the number of components (or clusters) in the input dataset using a Dirichlet prior, i.e., how many clusters you think there are.

I applied this model to data collected from separate farms, using sequence features aggregated to the Genus level as input. I calculated model fit using the Laplace approximation and then plotted the results as a function of the number of Dirichlet priors tested (see below).

DMM Model Fit

Unfortunately, the number of components to select from each farm is not clear (with the exception of Farm E). If I were to select the number of components that minimizes the Laplace, then I would end up with 20 components in most cases -- not a desirable outcome. I thought about using the elbow method, a common practice used in conjunction with PCA, but that doesn't seem to apply here since I am looking at model fit, not variance explained, ugh.

My objective for this analysis is to identify the important clusters and then describe what sequence features might be driving them, without overfitting my data. I would be eager to receive the communities feedback about how to go about this in a reasonable manner.

--Chris

Microbiome Clustering Statistics • 1.0k views

ADD COMMENT • link updated 3.3 years ago by Jean-Karim Heriche 27k • written 3.3 years ago by Chris Dean ▴ 420

score 2 · Answer 1 · 2022-02-08

The elbow method is a heuristic widely applicable, not just for PCA so I don't see any problem in using it here. The selected point has to be interpreted relative to what is being plotted. It can be thought of as the point beyond which one overfits/adds more noise than signal. Looking at your plots, if I had to pick the same number for all, 7 would seem like a good choice. Clustering is an ill-posed problem in that clusters are in the eye of the beholder. If you have some information on the existence of clusters then you could use this instead.