Question

Predicting survival of cancer patients using glmnet cox

4

Entering edit mode

5.6 years ago

Researcher ▴ 130

I want to use glmnet cox regression approach to predict survival from methylation data for cancer patients. But I couldn't find any descriptive reference from the authors except this one https://cran.r-project.org/web/packages/glmnet/vignettes/Coxnet.pdf, specially discussing its prediction values.

I am looking for answers to some of my basic questions.

can we predict the survival time (number of days for which patient will survive after diagnosis) and vital status of a cancer patient from its gene expression or methylation data using glmnet cox regression? If not: a) What does it actually predict then? If it is hazard ratio, then what is its statistical significance? b) How can I use these predicted values to infer the survival of these patient. So that I can compare these predictions with the actual survival information of the patients to test the accuracy of my model.
I have also used a real patient data and divided it into training (80%) and testing (20%). For training my model, I used the "overall survival" as well as "vital status of the patient" from the training set . Later on, used test data for prediction, where it is giving some negative values. But I have no clue what are these values and what does this negative sign mean, if it has some?

Please accept my apology for my short knowledge in statistics. Any help and suggestions are welcome.

Thanks in advance

R survival glmnet cox cancer • 9.8k views

ADD COMMENT • link updated 2.1 years ago by Kevin Blighe 88k • written 5.6 years ago by Researcher ▴ 130

0

Entering edit mode

can we predict the survival time (number of days for which patient will survive after diagnosis) and vital status of a cancer patient from its gene expression or methylation data using glmnet cox regression?

You can have days / time to death as the outcome variable, in which case the model becomes a linear regression. If you have vital status as the outcome, then it becomes a binary logistic regression model, as people can only either be alive or dead. Researchers do not typically use survival data in this way.

a) What does it actually predict then? If it is hazard ratio, then what is its statistical significance? b) How can I use these predicted values to infer the survival of these patient. So that I can compare these predictions with the actual survival information of the patients to test the accuracy of my model.

To be pedantic about it: models used in this way are not 'predicting' anything. If you set-up your model correctly, then time to death and vital status are taken into account when calculating the hazard ratio. Hazard ratio for gene expression obviously has a different interpretation than a hazard ratio for a binary categorical variable.

Usually the log rank p-value is taken and quoted in manuscripts.

For a better answer on the statistics behind all of this, please post your question at CrossValidated.

I have also used a real patient data and divided it into training (80%) and testing (20%). For training my model, I used the "overall survival" as well as "vital status of the patient" from the training set . Later on, used test data for prediction, where it is giving some negative values. But I have no clue what are these values and what does this negative sign mean, if it has some?

So, you are using TCGA data. In order for me to answer this part, you will have to show the code that you have used.

ADD REPLY • link 5.6 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi @Kevin, Thank you for your reply and a nice explanation. Yes I am using TCGA data and need some more clarification for your comments.

Researchers do not typically use survival data in this way.

What do you suggest then, how should I use the survival data of patients to predict their survival?

Usually the log rank p-value is taken and quoted in manuscripts.

In this statement you are referring to which manuscript.

Thanks

ADD REPLY • link 5.6 years ago by Researcher ▴ 130

2

Entering edit mode

If you need help in setting up the actual Survival model for use with glmnet Cox, then take a look here: Exploring association between genes by their expression

Specifically, take a look at my comment, where I provide a fully reproducible example to perform a lasso Cox regression: C: Exploring association between genes by their expression

ADD REPLY • link 5.6 years ago by Kevin Blighe 88k

0

Entering edit mode

What do you suggest then, how should I use the survival data of patients to predict their survival?

It could just be the way that you are describing it, but it seems like you simply want a model like this:

VitalStatus ~ gene

I was just implying that we typically have something like this:

Surv(time, VitalStatus) ~ gene

I have an entire tutorial, here: Survival analysis with gene expression

----------------------------------------

In this statement you are referring to which manuscript.

It was just a general observation from having worked with many physicians and having read manuscripts. I also just found This, which states: "The standard method, the log-rank test, was used for statistical comparison of survival times."

ADD REPLY • link 5.6 years ago by Kevin Blighe 88k

score 5 · Answer 1 · 2019-06-09

After our comments, I am just posting a fully reproducible example of how anyone can perform a Cox regression analysis with Lasso via glmnet, here using gene expression data. I post this due to the fact that your question title is likely to attract many hits from search engines, and therefore requires a reasonable answer, not a commentary.

You should also check the vignette for Cox models with lasso: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#cox

Following my example from here to get some data: Survival analysis via Cox Proportional Hazards regression

Download and set-up data

library(Biobase)
library(GEOquery)

# load series and platform data from GEO
gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
x <- exprs(gset[[1]])

# remove Affymetrix control probes
x <- x[-grep('^AFFX', rownames(x)),]

# transform the expression data to Z scores
x <- t(scale(t(x)))

# extract information of interest from the phenotype data (pdata)
idx <- which(colnames(pData(gset[[1]])) %in%
  c('age:ch1', 'distant rfs:ch1', 'er:ch1',
    'ggi:ch1', 'grade:ch1', 'node:ch1',
    'size:ch1', 'time rfs:ch1'))
metadata <- data.frame(pData(gset[[1]])[,idx],
  row.names = rownames(pData(gset[[1]])))

# remove samples from the pdata that have any NA value
discard <- apply(metadata, 1, function(x) anyis.na(x)))
metadata <- metadata[!discard,]

# filter the Z-scores expression data to match the samples in our pdata
x <- x[,which(colnames(x) %in% rownames(metadata))]

# check that sample names match exactly between pdata and Z-scores 
all((colnames(x) == rownames(metadata)) == TRUE)

# create a merged pdata and Z-scores object
coxdata <- data.frame(metadata, t(x))

# tidy column names
colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
  'GGI', 'Grade', 'Node',
  'Size', 'Time.RFS')

# prepare certain phenotypes
coxdata$Age <- as.numeric(gsub('^KJ', '', coxdata$Age))
coxdata$Distant.RFS <- as.numeric(coxdata$Distant.RFS)
coxdata$ER <- factor(coxdata$ER, levels = c(0, 1))
coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))
coxdata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', '', coxdata$Time.RFS))

Now perform the Cox survival analysis with lasso:

require(glmnet)
require(survival)

coxdata[1:10,1:15]
         Age Distant.RFS ER       GGI Grade Node Size Time.RFS  X1007_s_at
GSM65752  40           0  0   2.48005     3    0  1.2     2280  0.61091071
GSM65753  46           0  1 -0.633592     1    0  1.3     2675  0.66320599
GSM65754  67           1  1  -1.02972     1    0    6      426  0.39016262
GSM65755  41           1  1   1.04395     3    0  3.3      182 -0.01019652
GSM65756  38           1  1  0.683559     3    0  3.2       46  0.71052390
GSM65757  34           0  1   1.05919     2    0  1.6     3952  0.92185361
GSM65758  46           1  1  -1.23306     2    0  2.1     1824  0.01764556
GSM65760  57           1  1  0.679034     3    0  2.2      699  0.23028475
GSM65761  63           1  1   0.31585     2    0  2.8      730  1.19710288
GSM65762  54           1  1  -1.17846     2    0  1.7      882  0.70807685
            X1053_at   X117_at   X121_at  X1255_g_at    X1294_at  X1316_at
GSM65752  0.80321777 0.9576007 0.3236762  0.20730041 -0.57647252 0.5170007
GSM65753  0.12525380 0.4033209 0.1810776  0.08599057  0.43754667 0.5622666
GSM65754 -0.24124193 0.4739672 0.4422982  0.24498162  0.01942667 0.8180497
GSM65755  0.78790284 0.3364908 0.2314423  0.04583503 -0.62393026 0.3874972
GSM65756  0.28652687 0.4308167 0.3995143  0.36486763 -0.71256876 0.9834658
GSM65757  0.35280009 0.5448720 0.2622013 -0.03252980  0.08691080 0.5999085
GSM65758  0.11911061 0.6634477 0.2808262 -0.02828385  0.49225720 0.7275434
GSM65760 -0.07668636 0.3542536 0.4505841  0.33018455  0.63742514 0.6240488
GSM65761  0.31422159 0.3428862 0.3198296  0.17794074  1.01971818 0.7655406
GSM65762 -0.19605026 0.4011981 0.2035890  0.08022851  0.14603593 0.4257209


coxlasso <- glmnet(
  x = data.matrix(coxdata[,10:ncol(coxdata)]),
  y = Surv(coxdata$Time.RFS, coxdata$Distant.RFS),
  family = 'cox')

Kevin