Question

Accounting for Within-Subject Correlation in Microbiome Multiclass Classification

0

Entering edit mode

3 months ago

ssko ▴ 20

I have a microbiome data set with 960 samples from 269 subjects. 193 subjects have a single sample and 76 subjects have repeated measurements at different time points. The dataset also consists of 70 taxonomic features. The target variable has 4 classes, I want to build a multiclass classification model to classify them.

What steps should I follow to build a model with such a microbiome dataset and what should I consider? More specifically, to what extent should I account for within-subject correlation due to repeated measurements in the model?

When I evaluated each observation independently, I found that the accuracy of the random forest performance was 0.86 on training data and 0.90 on unseen data. Are these results biased?

microbiome classification longitudinal panel repeated-measures • 303 views

ADD COMMENT • link updated 3 months ago by Mensur Dlakic ★ 29k • written 3 months ago by ssko ▴ 20

score 1 · Answer 1 · 2025-01-27

What steps should I follow to build a model with such a microbiome dataset and what should I consider?

Cross-validation is a must for all classification tasks.

More specifically, to what extent should I account for within-subject correlation due to repeated measurements in the model?

All measurements of the same subject must be grouped. That means they are always part of the same fold during the cross-validation, be it for training or for validation purposes. They must be grouped as well if you are using them in unseen data.

Are these results biased?

Difficult to tell from the information you provided. If by evaluated each observation independently you mean that some measurements from identical subjects were used for training and some for validation, those results would be biased.