I am using random forest package to predict 'norm' versus 'chol', with the code below and have got a nice output regarding the importance of a panel of genes contributing to the classification of diseased tissues however I have been reading up on this and am wondering if I need a training and test data set, I have 11 normal and 18 diseased. I am very happy with the intuitive outputs this is giving but want to make sure its right
library(randomForest) clus2<-read.csv("PCA_NvC_SVM_sig.csv", sep = ",", header = T, row.names = 1)
attach(clus2)
set.seed(71)
clus2.rf <- randomForest(Pathology ~ ., data=clus2, importance=TRUE, proximity=TRUE)
print(clus2.rf)
result Call: randomForest(formula = Pathology ~ ., data = clus2, importance = TRUE, proximity = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 4
OOB estimate of error rate: 10.34% Confusion matrix: Chol Norm class.error Chol 17 1 0.05555556 Norm 2 9 0.18181818
Look at variable importance:
Imp<-round(importance(clus2.rf), 2) write.table(Imp, "Importance.csv",sep=",") varImpPlot(clus2.rf)