Hi.
I'm trying to run a random forest on some microarrays data using the following code, but I'm getting the titled error back. As you will see below, I tried to surpass this issue by following the link at stack overflow commented in the code, but without any success.
acc = numeric()
for(i in 1:20){
# Random Sampling with 70-30% for training and validation respectively
y = z = 0
while(y != 9 || z != 9){
sample = sample(x = 1:nrow(data) , size = 0.7 * nrow(data) )
train = data[sample,]
test = data[-sample,]
y = length(unique(train$classes))
z = length(unique(test$classes))
}
print(paste(y , z))
# https://stat.ethz.ch/pipermail/r-help/2008-March/156608.html
# https://stackoverflow.com/questions/17059432/random-forest-package-in-r-shows-error-during-prediction-if-there-are-new-fact
test$classes <- as.character(test$classes)
train$classes <- as.character(train$classes)
test$isTest <- rep(1,nrow(test))
train$isTest <- rep(0,nrow(train))
fullSet <- rbind(test,train)
fullSet$classes <- as.factor(fullSet$classes)
test.new <- fullSet[fullSet$isTest==1,]
train.new <- fullSet[fullSet$isTest==0,]
test.new$isTest = NULL
train.new$isTest = NULL
print(levels(test.new$classes))
print(levels(train.new$classes))
# Calculating the model with
# mtry : number of variables randomly sampled as candidates at eash split
# ntee : number of trees to grow
rf = randomForest(classes~., data=as.matrix(train.new), mtry=5, ntree=2000, importance=TRUE)
p = predict(rf, test.new)
acc = mean(test.new$classes == p)
print(acc)
# Keep track and save the models that have high accuracy
if(acc > 0.65){
print(acc)
saveRDS(rf , paste("./rf_models/rf_", i, "_", acc, ".rds", sep=""))
}
}
The error I'm getting is :
Error in predict.randomForest(rf, test.new) : New factor levels not present in the training data Calls: predict -> predict.randomForest Execution halted
And thus I added the
print(levels(test.new$classes))
print(levels(train.new$classes))
in order to see if the levels of the training and testing set were different.
The results of these lines returned me :
[1] "Treat1" "Treat2" "Treat3" "Treat4" "Treat5" "Treat6" "Treat7" "Tg" "Wt"
[1] "Treat1" "Treat2" "Treat3" "Treat4" "Treat5" "Treat6" "Treat7" "Tg" "Wt"
Is something that I'm doing wrong? How can I approach such an issue?
I get this error for ForestDNM on new versions of GATK haplotype caller VCFs. If someone figures this out I would love to know the answer!
One thing that I see is that you coerce your training data into a matrix with
as.matrix(train.new)
in therandomForest()
function. Using this may have unexpected consequences, one being that factors in train.new will be converted into 1, 2, 3, 4, etc., based on how they are ordered. Thus, they will differ already from the testing data.Another thing: when you split a data-frame that has categorical variables / factors, it's good practice to relevel those factors in the new objects, with, in your case:
I just realized that all predictors columns are factors. Could be this the cause of the problem? Should I convert them into numeric?
Yes, that will also create an issue - they should be numeric and the best way to avoid a situation like that is to go back through each step in order to determine where the numbers are being converted into factors.
Another problem, I believe, is with this piece of code:
This will mean that classes (encoded as integers) is going to be included as both a predictor and the outcome. Your data should be the original data without the outcome variable, something like:
I do something similar here with lasso (see the step 'Perform 10-fold cross validation'): A: How to exclude some of breast cancer subtypes just by looking at gene expressio
I removed the as.matrix() and also converted the factors to numeric with the following code:
They should become factors while I was reading that file
As for the classes column you said, it is not encoded as integers but as factors (this is the way the randomForest want it to be) and there is no need to be excluded from the training data itself. Once again, as I remember randomForest() can handle this.
Anyway. It seems that by removing the as.matrix() and converting gene expressions from factors to numeric, is now working.
Great that it is now resolved. On the conversion from factors to numerical values, please just double check that it has done this as you expected. This is R 'Programming', it's messy, and therefore things turn unexpected frequently!
Yeah. I noticed that. Everything seems to be right.
QVINTVS_FABIVS_MAXIMVS, if your problem is different, then please post a new question.
I figured it out. The VCFs I was working on had different Tranche levels. So it was a factor that was not trained on