I am a complete newbie to machine learning and this is my first try, but I am not sure I am approaching this field in the right way so would very appreciate feedback.
My data and what I am trying to answer: I have a data frame filled with IDs measured across three different features, a classification based on IDs, measured Features I would like to test, and values that depend on the particular ID-Feature combination.
My goal is to test whether one Feature, or a combination of Features, can be predictive of the Class_Name (defined from the ID). For now the data is in long format as below (example, the number of IDs varies for each feature and it is in the order of 100, so bigger than in this example):
df <- data.frame (Name = c("ID1", "ID2", "ID3", "ID4", "ID5", "ID6", "ID9", "ID5", "ID7", "ID8", "ID9", "ID10", "ID4", "ID11", "ID12", "ID8", "ID9", "ID11", "ID13", "ID8", "ID12", "ID4"), Class_Name = c("orange", "orange", "red", "orange", "blue", "blue", "red", "blue", "orange", "blue", "red", "red", "orange", "orange", "red", "blue", "red", "orange", "red", "blue", "red", "orange"), Features = c("F1","F1","F1", "F1","F1","F1","F1","F2","F2","F2","F2","F2","F2","F2","F2","F3", "F3", "F3", "F3", "F3", "F3", "F3"), Values = c(21, 32, -36, 11, -62, 32, 34, -21, -43, -68, -24, -19, 28, 33, -33, 15, 2, 13, -99, -86, 3, 0))
df$Class_Name= as.factor(df$Class_Name)
I think I need to test each Feature in turn (script below). However, how do I then test whether combinations of features help with the prediction?
I am following tutorials on caret particularly this. (If anyone knows of tutorials in R with examples like this one, please post them), but I am afraid I am not approaching my question in the correct way, and I was wondering if you could direct me to the correct path.
Script I have tried:
library(caret)
# library(dplyr)
# table(df$Class_Name)
# select 2 so have balanced set for each class?
# df <- df %>% group_by(Features, Class_Name) %>% slice_sample(n=2) %>% data.frame() # this will be 100 in my data
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
features = unique(df$Features)
res = data.frame()
for (f in features) {
dataset = df[df$Features == f, ]
# Random Forest
fit.rf <- train(Class_Name~Values, data=dataset, method="rf", metric=metric, trControl=control)
print(fit.rf)
}
Hi Jeremy, thank you very much for your reply and suggestions! I will check out the book, and thanks for referring me to the interaction question and function varImp() on caret!
As for the modeling, my question was more about the logic and set-up and I think I am doing something wrong with the original set-up, also because I cannot do what you are suggesting since the variables have different lengths...
Hi James, You're welcome! The error could be caused if you have NAs in your data. See a similar question on Stack Overflow: Variable Lengths Differ
You can impute the missing data using knnImputation() or something similar or remove rows with NAs using complete.cases as in the Stack Overflow answer.
I don't have the same IDs measured across three different features... so I don't think my set-up can test the interactions like this? If I want to test combined Features, I guess I need to combine them somehow before inputting into a new model...
First of all, I wouldn't use a loop to make your models because you're only saving the last model that you make. I would suggest making your models separately, which will make it easier to evaluate and compare them. I'm not sure I totally understand your data and what you're trying to do, but would it make sense to do something like this?