Accounting for Within-Subject Correlation in Microbiome Multiclass Classification
1
0
Entering edit mode
3 months ago
ssko ▴ 20

I have a microbiome data set with 960 samples from 269 subjects. 193 subjects have a single sample and 76 subjects have repeated measurements at different time points. The dataset also consists of 70 taxonomic features. The target variable has 4 classes, I want to build a multiclass classification model to classify them.

What steps should I follow to build a model with such a microbiome dataset and what should I consider? More specifically, to what extent should I account for within-subject correlation due to repeated measurements in the model?

When I evaluated each observation independently, I found that the accuracy of the random forest performance was 0.86 on training data and 0.90 on unseen data. Are these results biased?

microbiome classification longitudinal panel repeated-measures • 303 views
ADD COMMENT
1
Entering edit mode
3 months ago
Mensur Dlakic ★ 29k

What steps should I follow to build a model with such a microbiome dataset and what should I consider?

Cross-validation is a must for all classification tasks.

More specifically, to what extent should I account for within-subject correlation due to repeated measurements in the model?

All measurements of the same subject must be grouped. That means they are always part of the same fold during the cross-validation, be it for training or for validation purposes. They must be grouped as well if you are using them in unseen data.

Are these results biased?

Difficult to tell from the information you provided. If by evaluated each observation independently you mean that some measurements from identical subjects were used for training and some for validation, those results would be biased.

ADD COMMENT

Login before adding your answer.

Traffic: 3186 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6