Question

A first trial in machine learning for a classification problem: data set up and logic

1

Entering edit mode

3.1 years ago

jamespower ▴ 100

I am a complete newbie to machine learning and this is my first try, but I am not sure I am approaching this field in the right way so would very appreciate feedback.

My data and what I am trying to answer: I have a data frame filled with IDs measured across three different features, a classification based on IDs, measured Features I would like to test, and values that depend on the particular ID-Feature combination.

My goal is to test whether one Feature, or a combination of Features, can be predictive of the Class_Name (defined from the ID). For now the data is in long format as below (example, the number of IDs varies for each feature and it is in the order of 100, so bigger than in this example):

df <- data.frame (Name  = c("ID1", "ID2", "ID3", "ID4", "ID5", "ID6", "ID9", "ID5", "ID7", "ID8", "ID9", "ID10", "ID4", "ID11", "ID12", "ID8", "ID9", "ID11", "ID13", "ID8", "ID12", "ID4"), Class_Name = c("orange", "orange", "red", "orange", "blue", "blue", "red", "blue", "orange", "blue", "red", "red", "orange", "orange", "red", "blue", "red", "orange", "red", "blue", "red", "orange"), Features = c("F1","F1","F1", "F1","F1","F1","F1","F2","F2","F2","F2","F2","F2","F2","F2","F3", "F3", "F3", "F3", "F3", "F3", "F3"), Values = c(21, 32, -36, 11, -62, 32, 34, -21, -43, -68, -24, -19,  28,  33, -33,  15,  2, 13, -99, -86, 3, 0))
df$Class_Name= as.factor(df$Class_Name)

I think I need to test each Feature in turn (script below). However, how do I then test whether combinations of features help with the prediction?

I am following tutorials on caret particularly this. (If anyone knows of tutorials in R with examples like this one, please post them), but I am afraid I am not approaching my question in the correct way, and I was wondering if you could direct me to the correct path.

Script I have tried:

library(caret)
# library(dplyr)
# table(df$Class_Name) 
# select 2 so have balanced set for each class?
# df <- df %>% group_by(Features, Class_Name) %>% slice_sample(n=2) %>% data.frame() # this will be 100 in my data

control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

features = unique(df$Features)
res = data.frame()
for (f in features) {
    dataset = df[df$Features == f, ]

    # Random Forest
    fit.rf <- train(Class_Name~Values, data=dataset, method="rf", metric=metric, trControl=control)
    print(fit.rf)
}

caret MachineLearning R classifier • 1.5k views

ADD COMMENT • link updated 3.1 years ago by Jeremy ▴ 930 • written 3.1 years ago by jamespower ▴ 100

score 2 · Accepted Answer · 2022-05-07

2

Entering edit mode

3.1 years ago

Jeremy ▴ 930

For testing multiple interactions, you can use:

model = randomForest(response ~ (var1 + var2 + var3)^2, data = your_data)

See a similar question from Stack Overflow here: Interaction Effects in R

However, I believe caret is automatically looking for interaction effects. To see which variables are contributing the most to your model, use varImp(). You can also check out the book "An Introduction to Statistical Learning" by James et al., which should answer all of your questions.

ADD COMMENT • link 3.1 years ago by Jeremy ▴ 930

1

Entering edit mode

Hi Jeremy, thank you very much for your reply and suggestions! I will check out the book, and thanks for referring me to the interaction question and function varImp() on caret!

As for the modeling, my question was more about the logic and set-up and I think I am doing something wrong with the original set-up, also because I cannot do what you are suggesting since the variables have different lengths...

> train(Class_Name ~ (df$Values[df$Features=="F1"] + df$Values[df$Features=="F2"])^2, data=df, method="rf", metric=metric, trControl=control)
Error in model.frame.default(form = Class_Name ~ df$Values[df$Features ==  : 
  variable lengths differ (found for 'df$Values[df$Features == "F2"]')

ADD REPLY • link 3.1 years ago by jamespower ▴ 100

0

Entering edit mode

Hi James, You're welcome! The error could be caused if you have NAs in your data. See a similar question on Stack Overflow: Variable Lengths Differ

You can impute the missing data using knnImputation() or something similar or remove rows with NAs using complete.cases as in the Stack Overflow answer.

ADD REPLY • link 3.1 years ago by Jeremy ▴ 930

1

Entering edit mode

I don't have the same IDs measured across three different features... so I don't think my set-up can test the interactions like this? If I want to test combined Features, I guess I need to combine them somehow before inputting into a new model...

ADD REPLY • link 3.1 years ago by jamespower ▴ 100

0

Entering edit mode

First of all, I wouldn't use a loop to make your models because you're only saving the last model that you make. I would suggest making your models separately, which will make it easier to evaluate and compare them. I'm not sure I totally understand your data and what you're trying to do, but would it make sense to do something like this?

model = train(Class_Name ~ (Values + Features)^2, data = df, method = 'rf', metric = 'Accuracy', trControl = control)

ADD REPLY • link 3.1 years ago by Jeremy ▴ 930