I am looking to identify community types from a large, longitudinal 16S marker-gene dataset using Dirichlet Multinomial Mixture models (DMMs). The primary goal of this model is to identify the number of components (or clusters) in the input dataset using a Dirichlet prior, i.e., how many clusters you think there are.
I applied this model to data collected from separate farms, using sequence features aggregated to the Genus level as input. I calculated model fit using the Laplace approximation and then plotted the results as a function of the number of Dirichlet priors tested (see below).
Unfortunately, the number of components to select from each farm is not clear (with the exception of Farm E). If I were to select the number of components that minimizes the Laplace, then I would end up with 20 components in most cases -- not a desirable outcome. I thought about using the elbow method, a common practice used in conjunction with PCA, but that doesn't seem to apply here since I am looking at model fit, not variance explained, ugh.
My objective for this analysis is to identify the important clusters and then describe what sequence features might be driving them, without overfitting my data. I would be eager to receive the communities feedback about how to go about this in a reasonable manner.
--Chris