This tutorial is part of a series illustrating basic concepts and techniques for machine learning in R. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a past publication.
This follows from: Machine learning for cancer classification - part 1 - preparing the data sets and Machine learning for cancer classification - part 2 - Building a Random Forest Classifier. This part covers how to use a pre-computed Random Forest classification model to predict relapse in new samples of breast cancer. We make the assumption that the data sets will be in the correct format as produced in part 1. The data sets in this tutorial are the 'test data' from the prior tutorial, retrieved from GSE2990. To avoid the hassle of Copy-Pasting every block of code, the full script can be downloaded here.
Install any missing packages, if necessary
install.packages("randomForest")
install.packages("ROCR")
install.packages("Hmisc")
Load the necessary packages.
library(randomForest)
library(ROCR)
require(Hmisc)
Set working directory and filenames for Input/output. Use the directory where you saved the RF_model
, testset_gcrma.txt
and testset_clindetails.txt
files from Part 2.
setwd("/Users/obigriffith/git/biostar-tutorials/MachineLearning")
RF_model_file="RF_model"
datafile="testset_gcrma.txt"
clindatafile="testset_clindetails.txt"
outfile="testset_RFoutput.txt"
case_pred_outfile="testset_CasePredictions.txt"
ROC_pdffile="testset_ROC.pdf"
vote_dist_pdffile="testset_vote_dist.pdf"
Next we will read in the data sets (expecting a tab-delimited file with header line and rownames). We also need to rearrange the clinical data so that it will be in the same order as the GCRMA expression data.
data_import=read.table(datafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_import=read.table(clindatafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_order=order(clin_data_import[,"geo_accn"])
clindata=clin_data_import[clin_data_order,]
data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above
header=colnames(rawdata)
Extract just the expression values from the filtered data and transpose the matrix.
predictor_data=t(rawdata[,4:length(header)])
predictor_names=as.vector((rawdata[,3]))
colnames(predictor_data)=predictor_names
Load the RandomForests classifier from file (object rf_output
which was saved previously)
load(file=RF_model_file)
Extract predictor (gene) names from RF model built in the previous tutorial and subset the test data to just these predictors. This is not strictly necessary as the randomForest predict function would automatically restrict itself to these variables. It has been added for clarity.
RF_predictor_names=rownames(rf_output$importance)
predictor_data=predictor_data[,RF_predictor_names]
Run the test data through forest!
RF_predictions_responses=predict(rf_output, predictor_data, type="response")
RF_predictions_votes=predict(rf_output, predictor_data, type="vote")
Join predictions with clinical data and exclude rows with missing clinical data needed for subsequent steps. Then, write results to file.
clindata_plusRF=cbind(clindata,RF_predictions_responses,RF_predictions_votes)
clindata_plusRF=clindata_plusRF[which(!is.na(clindata_plusRF[,"event.rfs"])),]
write.table(clindata_plusRF,file=case_pred_outfile, sep="\t", quote=FALSE, col.names=TRUE, row.names=FALSE)
As before, the following lines will give an overview of the classifier's performance. This time instead of estimated performance from out-of-bag (OOB) testing we are measuring actual performance on an independent test set.
confusion=table(clindata_plusRF[,c("event.rfs","RF_predictions_responses")])
rownames(confusion)=c("NoRelapse","Relapse")
sensitivity=(confusion[2,2]/(confusion[2,2]+confusion[2,1]))*100
specificity=(confusion[1,1]/(confusion[1,1]+confusion[1,2]))*100
overall_error=((confusion[1,2]+confusion[2,1])/sum(confusion))*100
overall_accuracy=((confusion[1,1]+confusion[2,2])/sum(confusion))*100
class1_error=confusion[1,2]/(confusion[1,1]+confusion[1,2])
class2_error=confusion[2,1]/(confusion[2,2]+confusion[2,1])
Next we will prepare each useful statistic for writing to an output file
sens_out=paste("sensitivity=",sensitivity, sep="")
spec_out=paste("specificity=",specificity, sep="")
err_out=paste("overall error rate=",overall_error,sep="")
acc_out=paste("overall accuracy=",overall_accuracy,sep="")
misclass_1=paste(confusion[1,2], colnames(confusion)[1],"misclassified as", colnames(confusion)[2], sep=" ")
misclass_2=paste(confusion[2,1], colnames(confusion)[2],"misclassified as", colnames(confusion)[1], sep=" ")
confusion_out=confusion[1:2,1:2]
confusion_out=cbind(rownames(confusion_out), confusion_out)
Create variables for the known target class and predicted class probabilities.
target=clindata_plusRF[,"event.rfs"]
target[target==1]="Relapse"
target[target==0]="NoRelapse"
relapse_scores=clindata_plusRF[,"Relapse"]
Once again, we will create an ROC curve and calculate the area under it (AUC). Recall that we use the relapse/non-relapse vote fractions as predictive variable. The ROC curve is generated by stepping through different thresholds for calling relapse vs non-relapse.
First calculate the AUC value.
pred=prediction(relapse_scores,target)
perf_AUC=performance(pred,"auc")
AUC=perf_AUC@y.values[[1]]
AUC_out=paste("AUC=",AUC,sep="")
Print the results to file along with other stats prepared above.
write("confusion table", file=outfile)
write.table(confusion_out,file=outfile, sep="\t", quote=FALSE, col.names=TRUE, row.names=FALSE, append=TRUE)
write(c(sens_out,spec_out,acc_out,err_out,misclass_1,misclass_2,AUC_out), file=outfile, append=TRUE)
Then, plot the actual ROC curve.
perf_ROC=performance(pred,"tpr","fpr")
pdf(file=ROC_pdffile)
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(AUC, digits=5, scientific=FALSE)))
dev.off()
You should now have case prediction file with which we will perform some survival analysis. See: Machine learning for cancer classification - part 4 - Plotting a Kaplan-Meier Curve for Survival Analysis.
I would be interested to look at a pythonic version of the tutorial utilizing scikits-learn or other library.
Me too. Are you volunteering to create one?
Tutorials are written by experts in the field like you. I tried machine learning a few times on datasets which were too biased or short and the features chosen did not desribe them properly. So I did not publish any paper on it.
Dear Obi
I am using your tutorial on my test data and Part 2 worked fabulously on my training data set and I thank you wholeheartedly for that, but now I have been stuck for several days. My code gets stuck at the prediction stage where I will be generating my RF predictions. This is the code I have received on my dataset
I am not sure what is going on. I wrote the prediction_names and predictor_data to a csv file but I am still very much confused. I am not sure why it is not finding SubjectTimePoint, it is there in my predictor_data. I have SubjectTimePoint (factor), Relapse(factor), Window(a factor which tells what month the timepoint is in) and then several genera. I want to predict relapse or no relapse from my training data set on this new test dataset.
I tried looking up the error in StackOverflow but I am afraid I still do not understand what is going on.
I am new to R and I am just trying to learn, so please accept my apologies if this may come off awkward or off base. But if you can help me, I would so very grateful
Respectfully yours
Vijay Antharam
(mst3000jay@gmail.com)
Buenas noches, soy nuevo programando en R, alguien podría por favor decirme cómo solucionar este error? Good evening, I am new to programming in R, someone could tell me how to solve this error?
Good evening, I'm new programming in R GSE2034_clindata=getGSEDataTables("GSE2034")[[2]][1:286,]// run this error : Unknown IO error // and answer this I/O warning : failed to load external entity "http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?targ=self&form=xml&view=full&acc=GSE2034" Error in UseMethod("xmlRoot") : no applicable method for 'xmlRoot' applied to an object of class "NULL"
Similar to the previous comments, when I run the chunk;
I get the
"Error in predictor_data[, RF_predictor_names] : subscript out of bounds"
and the subsequent code:
gives
Did anybody meet and solve these problems?