After our comments, I am just posting a fully reproducible example of how anyone can perform a Cox regression analysis with Lasso via glmnet, here using gene expression data. I post this due to the fact that your question title is likely to attract many hits from search engines, and therefore requires a reasonable answer, not a commentary.
You should also check the vignette for Cox models with lasso: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#cox
Following my example from here to get some data: Survival analysis via Cox Proportional Hazards regression
Download and set-up data
library(Biobase)
library(GEOquery)
# load series and platform data from GEO
gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
x <- exprs(gset[[1]])
# remove Affymetrix control probes
x <- x[-grep('^AFFX', rownames(x)),]
# transform the expression data to Z scores
x <- t(scale(t(x)))
# extract information of interest from the phenotype data (pdata)
idx <- which(colnames(pData(gset[[1]])) %in%
c('age:ch1', 'distant rfs:ch1', 'er:ch1',
'ggi:ch1', 'grade:ch1', 'node:ch1',
'size:ch1', 'time rfs:ch1'))
metadata <- data.frame(pData(gset[[1]])[,idx],
row.names = rownames(pData(gset[[1]])))
# remove samples from the pdata that have any NA value
discard <- apply(metadata, 1, function(x) anyis.na(x)))
metadata <- metadata[!discard,]
# filter the Z-scores expression data to match the samples in our pdata
x <- x[,which(colnames(x) %in% rownames(metadata))]
# check that sample names match exactly between pdata and Z-scores
all((colnames(x) == rownames(metadata)) == TRUE)
# create a merged pdata and Z-scores object
coxdata <- data.frame(metadata, t(x))
# tidy column names
colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
'GGI', 'Grade', 'Node',
'Size', 'Time.RFS')
# prepare certain phenotypes
coxdata$Age <- as.numeric(gsub('^KJ', '', coxdata$Age))
coxdata$Distant.RFS <- as.numeric(coxdata$Distant.RFS)
coxdata$ER <- factor(coxdata$ER, levels = c(0, 1))
coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))
coxdata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', '', coxdata$Time.RFS))
Now perform the Cox survival analysis with lasso:
require(glmnet)
require(survival)
coxdata[1:10,1:15]
Age Distant.RFS ER GGI Grade Node Size Time.RFS X1007_s_at
GSM65752 40 0 0 2.48005 3 0 1.2 2280 0.61091071
GSM65753 46 0 1 -0.633592 1 0 1.3 2675 0.66320599
GSM65754 67 1 1 -1.02972 1 0 6 426 0.39016262
GSM65755 41 1 1 1.04395 3 0 3.3 182 -0.01019652
GSM65756 38 1 1 0.683559 3 0 3.2 46 0.71052390
GSM65757 34 0 1 1.05919 2 0 1.6 3952 0.92185361
GSM65758 46 1 1 -1.23306 2 0 2.1 1824 0.01764556
GSM65760 57 1 1 0.679034 3 0 2.2 699 0.23028475
GSM65761 63 1 1 0.31585 2 0 2.8 730 1.19710288
GSM65762 54 1 1 -1.17846 2 0 1.7 882 0.70807685
X1053_at X117_at X121_at X1255_g_at X1294_at X1316_at
GSM65752 0.80321777 0.9576007 0.3236762 0.20730041 -0.57647252 0.5170007
GSM65753 0.12525380 0.4033209 0.1810776 0.08599057 0.43754667 0.5622666
GSM65754 -0.24124193 0.4739672 0.4422982 0.24498162 0.01942667 0.8180497
GSM65755 0.78790284 0.3364908 0.2314423 0.04583503 -0.62393026 0.3874972
GSM65756 0.28652687 0.4308167 0.3995143 0.36486763 -0.71256876 0.9834658
GSM65757 0.35280009 0.5448720 0.2622013 -0.03252980 0.08691080 0.5999085
GSM65758 0.11911061 0.6634477 0.2808262 -0.02828385 0.49225720 0.7275434
GSM65760 -0.07668636 0.3542536 0.4505841 0.33018455 0.63742514 0.6240488
GSM65761 0.31422159 0.3428862 0.3198296 0.17794074 1.01971818 0.7655406
GSM65762 -0.19605026 0.4011981 0.2035890 0.08022851 0.14603593 0.4257209
coxlasso <- glmnet(
x = data.matrix(coxdata[,10:ncol(coxdata)]),
y = Surv(coxdata$Time.RFS, coxdata$Distant.RFS),
family = 'cox')
Kevin
You can have
days
/time to death
as the outcome variable, in which case the model becomes a linear regression. If you havevital status
as the outcome, then it becomes a binary logistic regression model, as people can only either be alive or dead. Researchers do not typically use survival data in this way.To be pedantic about it: models used in this way are not 'predicting' anything. If you set-up your model correctly, then
time to death
andvital status
are taken into account when calculating the hazard ratio. Hazard ratio for gene expression obviously has a different interpretation than a hazard ratio for a binary categorical variable.Usually the log rank p-value is taken and quoted in manuscripts.
For a better answer on the statistics behind all of this, please post your question at CrossValidated.
So, you are using TCGA data. In order for me to answer this part, you will have to show the code that you have used.
Hi @Kevin, Thank you for your reply and a nice explanation. Yes I am using TCGA data and need some more clarification for your comments.
What do you suggest then, how should I use the survival data of patients to predict their survival?
In this statement you are referring to which manuscript.
Thanks
If you need help in setting up the actual Survival model for use with glmnet Cox, then take a look here: Exploring association between genes by their expression
Specifically, take a look at my comment, where I provide a fully reproducible example to perform a lasso Cox regression: C: Exploring association between genes by their expression
It could just be the way that you are describing it, but it seems like you simply want a model like this:
I was just implying that we typically have something like this:
I have an entire tutorial, here: Survival analysis with gene expression
----------------------------------------
It was just a general observation from having worked with many physicians and having read manuscripts. I also just found This, which states: "The standard method, the log-rank test, was used for statistical comparison of survival times."