Question

Only one node generated after using decision tree model on training data set

0

Entering edit mode

9 months ago

Mohamed Samir ▴ 30

I am trying to build a decision tree model predicting an outcome variable (named : Results) based on predictor variable. Indeed, I have applied one-hot encoding on some of the ">2 level" variables to enable expanding the n of predictors a bit![ [My]1 data]. I first explored the data and then split it into 80/20 split and run the model, but the model run on training data set ends with only one node with no branches as seen in this figure below.

Results .

Looking to similar posts, I sought that my data is unbalanced because by checking the prop.table of the class assignation (of the Results variable), the majority was for negative rather than for positive. Any suggestions for creating a correct tree on this data

Here comes my code:

splitting the data into test and train data (80% train and 20% test data)

set.seed(1234)
pd <- sample(2, nrow(data_hum_mod), replace = TRUE, prob = c(0.8,0.2))
data_hum_train <- data_hum_mod[pd==1,]
data_hum_test<- data_hum_mod[pd==2,]

Data exploration after splitting

Check the data dimension

dim(data_hum_train); dim(data_hum_test) #make sure that the spllited data have balanced n of each of the outcome classes (i.e. positive/negative toxo)

prop.table(table(data_hum_train$Results)) * 100
prop.table(table(data_hum_test$Results)) *100

This gave the following results:

# (Train)
Negative Positive 
75.75758 24.24242

# and (Test)

Negative Positive 
54.54545 45.45455

# Check missing values
anyNA(data_hum_mod) 

#Make sure none of the variables are zero or near-zero variance
nzv(data_hum_mod)

# building the model (using party package)

install.packages('party')
library(party)
data_human_train_tree<- ctree(Results ~., data = data_hum_train,
                              controls = ctree_control(mincriterion = 0.1))
data_human_train_tree
plot(data_human_train_tree)

With this code I obtained this figure

It gave me the same results using other packages like C50 and rpart

Could you advise on this? and I read about the subsampling for the majority class (here is the negative Results), how can one implement this in R?

Machine-learning Decision-tree • 785 views

ADD COMMENT • link 9 months ago by Mohamed Samir ▴ 30

score 0 · Answer 1 · 2024-06-24

0

Entering edit mode

9 months ago

Philipp Bayer 8.8k

Is it possible that data_hum_train still has the Results column when you use it for training? I can't see the step where you remove the Results column from the training data.

That would explain why there's only one node: just look at the Results column to predict the Results column.

ADD COMMENT • link 9 months ago by Philipp Bayer 8.8k

0

Entering edit mode

Yes, I keep the Results column in the train and in the test Data when I split the whole matrix. This is because I am going to train the model on it. Is it a must to remove it when one train the model ? if so, which will be the dependent outcome variable in this equation ?

data_human_train_tree<- ctree(Results ~., data = data_hum_train,
                              controls = ctree_control(mincriterion = 0.1))

ADD REPLY • link 9 months ago by Mohamed Samir ▴ 30

0

Entering edit mode

Can you please comment on this: I have not seen any model to be built when we need to remove the "outcome" variable before running the model ? Can you comment on why you think we need to do it here ?

ADD REPLY • link 9 months ago by Mohamed Samir ▴ 30