Question

Tutorial:Machine Learning For Cancer Classification - Part 4 - Plotting A Kaplan-Meier Curve For Survival Analysis

52

Entering edit mode

11.2 years ago

Nicholas Spies ★ 1.2k

This tutorial is part of a series illustrating basic concepts and techniques for machine learning in R. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a past publication.

This follows from: Machine learning for cancer classification - part 1 - preparing the data sets, Machine learning for cancer classification - part 2 - Building a Random Forest Classifier and Machine learning for cancer classification - part 3 - Predicting with a Random Forest Classifier. For this tutorial, I will be referring to the test data set used in part 3 (GSE2990). Here we will take case predictions from the Random Forest Classifier, divide into low, intermediate and high risk groups and then perform survival analysis to determine whether these groups predict long-term outcome. The full script can be downloaded here.

Install the necessary packages (if not already installed) and load them.

install.packages("survival")
library(survival)

Set a directory and output files. Use the directory where you saved the testset_CasePredictions.txt file from Part 4.

datadir="/Users/obigriffith/git/biostar-tutorials/MachineLearning"
setwd(datadir)
case_pred_outfile="testset_CasePredictions.txt"
KMplotfile="KaplanMeier_TestSet_RFRS.pdf"

Import the results of the Random Forest classifier on the test data set

clindata_plusRF=read.table(case_pred_outfile, header = TRUE, na.strings = "NA", sep="\t")

Next, we will divide our dataset into quantiles based on the relapse risk predicted by random forests. In this example we use three groups, low, medium and high, separated evenly into thirds.

quantiles=quantile(clindata_plusRF[,"Relapse"], probs=c(0.33333,0.66667))
clindata_plusRF[,"RF_Group2"]=clindata_plusRF[,"Relapse"]
clindata_plusRF[which(clindata_plusRF[,"Relapse"]<=quantiles[1]),"RF_Group2"]="low"
clindata_plusRF[which(clindata_plusRF[,"Relapse"]>quantiles[1] &  clindata_plusRF[,"Relapse"]<=quantiles[2]),"RF_Group2"]="int"
clindata_plusRF[which(clindata_plusRF[,"Relapse"]>quantiles[2]),"RF_Group2"]="high"

Rename the time column in our data to make it easier to read.

clindata_plusRF[,"t_rfs"]=clindata_plusRF[,"time.rfs"]

Next, we add the event column. In our case we screened out any data points which lie outside of 10 years after the starting point, as seen by the second line of code here.

clindata_plusRF[,"e_rfs_10yrcens"]=clindata_plusRF[,"event.rfs"]
clindata_plusRF[which(clindata_plusRF[,"t_rfs"]>10),"e_rfs_10yrcens"]=0

At this point, clindata_plusRF contains quite a bit of information we won't use at the moment, so it helps to create a streamlined dataframe with only the pertinent information.

surv_data=clindata_plusRF[,c("t_rfs","e_rfs_10yrcens","RF_Group2")]

Take that data and create a survival object using the Surv() function.

surv_data.surv = with(surv_data, Surv(t_rfs, e_rfs_10yrcens==1))

Calculate a p-value across the three groups and format it to three significant digits.

survdifftest=survdiff(surv_data.surv ~ RF_Group2, data = surv_data)
survpvalue = 1 - pchisq(survdifftest$chisq, length(survdifftest$n) - 1)
survpvalue = format(as.numeric(survpvalue), digits=3)

The next code block creates a linear test p-value, using each of our risk groups as ordinal variables (low -> int -> high).

surv_data_lin=clindata_plusRF[,c("t_rfs","e_rfs_10yrcens","RF_Group2")]
surv_data_lin[,"RF_Group2"]=factor(surv_data_lin[,"RF_Group2"],ordered = TRUE,levels=c("low","int","high"))
survpvalue_linear=summary(coxph(Surv(t_rfs, e_rfs_10yrcens)~RF_Group2, data=surv_data_lin))$sctest[3]
survpvalue_linear = format(as.numeric(survpvalue_linear), digits=3)

Finally, we plot our Kaplan-Meier curve using our survival object. The following code will give you a KM plot.

krfit.by_RFgroup = survfit(surv_data.surv ~ RF_Group2, data = surv_data)
pdf(file=KMplotfile)
colors = rainbow(5)
title="Survival by RFRS - Test Set"
plot(krfit.by_RFgroup, col = colors, xlab = "Time (Years)", ylab = "Relapse Free Survival", main=title, cex.axis=1.3, cex.lab=1.4)
abline(v = 10, col = "black", lty = 3)

The final block of code will plot a legend to help with interpretation of the figure

groups=sort(unique(surv_data[,"RF_Group2"])) #returns unique factor levels sorted alphabetically
names(colors)=groups
groups_custom=c("low","int","high")
colors_custom=colors[groups_custom]
group_sizes_custom=table(surv_data[,"RF_Group2"])[groups_custom]
groups_custom=c("Low","Intermediate","High") #Reset names for consistency with manuscript
legend_text=c(paste(groups_custom, " ", "(", group_sizes_custom, ")", sep=""),paste("p =", survpvalue_linear, sep=" "))
legend(x = "bottomleft", legend = legend_text, col = c(colors_custom,"white"), lty = "solid", bty="n", cex=1.2)
dev.off()

As you can see in the KM plot, patients predicted by the random forests classifier to have low probability of relapse have much better relapse-free survival outcomes than patients predicted to have high probability of relapse. This is of course expected given that we trained the random forest classifier on relapse status. However, it illustrates a practical application of the machine learning exercise completed in tutorials 1 to 3. In theory, using the Random Forest probabilities, we could define a cut-off for a low-risk group that is sufficiently unlikely to have a relapse over the next 10 years to alter their treatment from aggressive chemo to watch-and-wait.

Survival classification r cancer Machine-learning • 15k views

ADD COMMENT • link updated 6 months ago by Obi Griffith 20k • written 11.2 years ago by Nicholas Spies ★ 1.2k

3

Entering edit mode

I know it's super common to use Kaplan Meier Curves, but I want to suggest also considering cumulative incidence plots. In some cancers, especially prostate cancer, patients are likely to die of other causes. Using standard survival analysis, patients that drop out of the study and patients that die due to other causes are considered the same. But this isn't accurate as a patient who dropped out of the study can still develop the event and someone who died being hit by a bus cannot.

Cumulative incidence, accounting for competing risks, allows you to further segregate the censored group such that those who dropped out would be different that those that died due to other causes. It gives you a better estimate of the incidence (where 1 - incidence = probability of survival). It can be done in R using cmprsk library. Check it out - and consider it as an alternative to KM survival analysis.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 11.2 years ago by amcrisan ▴ 370

0

Entering edit mode

Thanks for this tip!

ADD REPLY • link 11.2 years ago by Obi Griffith 20k

1

Entering edit mode

In this code

surv_data_lin[,"RF_Group2"]=as.numeric(surv_data_lin[,"RF_Group2"])

should it be as.factor rather then as.numeric?

surv_data_lin[,"RF_Group2"]=as.factor(surv_data_lin[,"RF_Group2"])

By reading this "we will take case predictions ... divide into low, intermediate and high risk groups ..." I think RF_Group2 is categorical variable.

Using as.numeric and as.factor in coxph give different results.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.5 years ago by Sheila ▴ 280

0

Entering edit mode

RF_Group2 is categorical but we want to treat it as an ordinal categorical variable (ordered from low to medium to high) rather than a nominal categorical variable (where group order is meaningless). This is why we recoded the categories from "low", "int", "high" to 1, 2, 3 and treated them as numerical rather than factor which by default is not ordered in R. I was following an example I found elsewhere for how to do it (that source no longer exists). However, I think you are right that it would be more correct to convert these categories to an ordered factor. The code has been updated to reflect this. It did have a small effect on the p-value.

ADD REPLY • link 6 months ago by Obi Griffith 20k

Ram · Answer 1 · 2015-08-27

Thank you very much for this wonderful tutorials.

I took the liberty of refactoring this script to dplyr, which should be more concise and easier to read.

For example the following code:

#read RF results and clinical data from file
clindata_plusRF=read.table(case_pred_outfile, header = TRUE, na.strings = "NA", sep="\t")

#Create new risk grouping with additional groups
#Add column for new grouping
quantiles=quantile(clindata_plusRF[,"Relapse"], probs=c(0.33333,0.66667))
clindata_plusRF[,"RF_Group2"]=clindata_plusRF[,"Relapse"]
clindata_plusRF[which(clindata_plusRF[,"Relapse"]<=quantiles[1]),"RF_Group2"]="low"
clindata_plusRF[which(clindata_plusRF[,"Relapse"]>quantiles[1] &  clindata_plusRF[,"Relapse"]<=quantiles[2]),"RF_Group2"]="int"
clindata_plusRF[which(clindata_plusRF[,"Relapse"]>quantiles[2]),"RF_Group2"]="high"

#Rename time column for easy scripting
clindata_plusRF[,"t_rfs"]=clindata_plusRF[,"time.rfs"]

#Add column of 10yr censored data
clindata_plusRF[,"e_rfs_10yrcens"]=clindata_plusRF[,"event.rfs"]
clindata_plusRF[which(clindata_plusRF[,"t_rfs"]>10),"e_rfs_10yrcens"]=0

Gets reduced to:

#Create new risk grouping with additional groups & 10yr censored data
clindata_plusRF = clindata_plusRF %>% 
    rename(t_rfs = time.rfs) %>%          # Rename time column for easy scripting
    mutate(
        RF_Group2 = c('low', 'int', 'high')[ntile(Relapse, 3)], # Add column for new grouping
        e_rfs_10yrcens = ifelse (t_rfs>10, 0, event.rfs)        # Add column of 10yr censored data
        )

I've sent you a pull request on github, with an alternative version of the script.

score 0 · Answer 2 · 2017-06-13

0

Entering edit mode

7.7 years ago

jmzeng1314 ▴ 140

In the study "GSE2990", the sample size is 189, and you've filtered some samples which didn't have survival information.

Why do you still can plot 62+63+62 samples in you KM plot ?

ADD COMMENT • link 7.7 years ago by jmzeng1314 ▴ 140

0

Entering edit mode

The GSE2990 study did include 189 samples. But, two of these were "NA" for event.rfs in the testset_clindetails.txt file and were therefore dropped in part 3. That left 187 samples (62 + 63 + 62 = 187).

ADD REPLY • link 6 months ago by Obi Griffith 20k