I have created a model predicting an outcome with two classes ("active" and "no history") including 23 features-genes using VST transformed RNA-seq data (134 samples in the training dataset). I have used caret package in R and using glm model. I am trying to interpret my model using pdp package.
I have attached a pdp plot for gene A (5th most important gene in my model), pdp plot with ICE curves and a boxplot showing the expression of this gene for "active" and "no history" status. Looking at the pdp plot I understand that the higher the expression of gene A the higher the probability for predicting "active" status. However, I see the opposite in the boxplot. I observe this phenomenon for 3 more features in my model however the range for the probability is 0.28-0.3 which is almost a flat line if I plot the pdp from 0-1 probability. These other 3 genes are lower in the list of feature importance.
I have checked that features do not correlate with each other.
I am wondering why this happens I assume that the other features in the model interact some how with gene A. Is this a valid assumption?
Creating model
library(caret) ctrl.continuous <- trainControl(method = "repeatedcv", number = 10, repeats = 5, savePredictions = "final") model<-train(train_VST_clin[,-24],train_VST_clin[,outcomeName], method='glm', trControl=ctrl.continuous)
PDP plot
library(pdp) pdp_plot <- partial(model, pred.var = "gene A", prob=TRUE, plot = TRUE, rug=TRUE, which.class="active")