Control-disease pair study design

0

Entering edit mode

5.3 years ago

Gene_MMP8 ▴ 240

I have a clinical dataset for a particular disease. I have the mortality variable as well in my feature set. I am building machine learning models to predict mortality using the other clinical features (38 in total and sample size 276). The features are mostly categorical. I have three disease stages listed: - Control(47 samples), disease(52 samples), non-disease(176 samples). The values of the clinical features for the control set is all "n/a", meaning no information was collected for the control cases. Is it wise therefore to consider the "control" cases as "non-disease" and also consider them as living (mortality -yes)? By doing that I will gain in sample size for the model building purpose. The missing values will be imputed using some technique.
So is it a right approach to do? Am I introducing bias in the model by doing imputation for so many categorical features for the control cases?

R • 613 views

ADD COMMENT • link 5.3 years ago by Gene_MMP8 ▴ 240

Login before adding your answer.