I am trying to build a decision tree model predicting an outcome variable (named : Results) based on predictor variable. Indeed, I have applied one-hot encoding on some of the ">2 level" variables to enable expanding the n of predictors a bit![ [My]1 data]. I first explored the data and then split it into 80/20 split and run the model, but the model run on training data set ends with only one node with no branches as seen in this figure below.
.
Looking to similar posts, I sought that my data is unbalanced because by checking the prop.table of the class assignation (of the Results variable), the majority was for negative rather than for positive. Any suggestions for creating a correct tree on this data
Here comes my code:
splitting the data into test and train data (80% train and 20% test data)
set.seed(1234)
pd <- sample(2, nrow(data_hum_mod), replace = TRUE, prob = c(0.8,0.2))
data_hum_train <- data_hum_mod[pd==1,]
data_hum_test<- data_hum_mod[pd==2,]
Data exploration after splitting
Check the data dimension
dim(data_hum_train); dim(data_hum_test) #make sure that the spllited data have balanced n of each of the outcome classes (i.e. positive/negative toxo)
prop.table(table(data_hum_train$Results)) * 100
prop.table(table(data_hum_test$Results)) *100
This gave the following results:
# (Train)
Negative Positive
75.75758 24.24242
# and (Test)
Negative Positive
54.54545 45.45455
# Check missing values
anyNA(data_hum_mod)
#Make sure none of the variables are zero or near-zero variance
nzv(data_hum_mod)
# building the model (using party package)
install.packages('party')
library(party)
data_human_train_tree<- ctree(Results ~., data = data_hum_train,
controls = ctree_control(mincriterion = 0.1))
data_human_train_tree
plot(data_human_train_tree)
With this code I obtained this figure
It gave me the same results using other packages like C50 and rpart
Could you advise on this? and I read about the subsampling for the majority class (here is the negative Results), how can one implement this in R?
Yes, I keep the Results column in the train and in the test Data when I split the whole matrix. This is because I am going to train the model on it. Is it a must to remove it when one train the model ? if so, which will be the dependent outcome variable in this equation ?
Can you please comment on this: I have not seen any model to be built when we need to remove the "outcome" variable before running the model ? Can you comment on why you think we need to do it here ?