Question

R survival analysis

3

Entering edit mode

10.5 years ago

hdy ▴ 180

I am now learning how to do survival analysis in R and using COX proportional hazards model, which can be referred to the 'coxph' function under package 'survival'. When I looked some online examples, eg, here, I found sometimes the code looks like

Surv(time, censor) ~ factor(hercoc)

But I can also do

Surv(time, censor) ~ hercoc

I was wondering what is the difference between these two and when shall I use which one. And how should I interpret the result from the one used 'factor'

R survival • 16k views

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by hdy ▴ 180

0

Entering edit mode

In R the factor data format should be used for categorical data. For example, if you were doing survival analysis for three different treatments

treatments<- c(1,2,3)

Then you should pass this vector as a factor because the data are categorical. If you did not do this then R would assume the data are continuous and might cause misinterpretations of the results.

On the other hand, if the treatment was of one drug but at different concentrations such as

treatment<- c(0,1,1.5,2)

Then you should not factor these data because they are continuous.

At least that's my understanding, others please chime in

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.6k

Ram · Accepted Answer · 2014-08-12

If your variable hercoc has only two levels then there is no difference. However, if it has 3 or more levels then there is a difference. You haven't provided any example data, and I am assuming that hercoc is numeric.

Using a more concrete example:

library(survival)

attach(lung)

head(lung)

#     inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
#1      3  306      2  74   1       1       90       100     1175      NA
#2      3  455      2  68   1       0       90        90     1225      15
#3      3 1010      1  56   1       0       90        90       NA      15
#4      5  210      2  57   1       1       90        60     1150      11

The variable sex has two levels but is coded as a numeric variable here (1 for male, 2 for female or vice versa).

summary(as.factor(sex))
  1   2 
138  90

There's only one coefficient fitted by the model regardless of how you write it:

coxph(Surv(time, status) ~ sex)

# Call:
coxph(formula = Surv(time, status) ~ sex)

      coef exp(coef) se(coef)     z      p
sex -0.531     0.588    0.167 -3.18 0.0015

Likelihood ratio test=10.6  on 1 df, p=0.00111  n= 228, number of events= 165 

coxph(Surv(time, status) ~ as.factor(sex))
Call:
coxph(formula = Surv(time, status) ~ as.factor(sex))

                  coef exp(coef) se(coef)     z      p
as.factor(sex)2 -0.531     0.588    0.167 -3.18 0.0015

Likelihood ratio test=10.6  on 1 df, p=0.00111  n= 228, number of events= 165

For the variable ph.ecog there are 4 levels

summary(as.factor(ph.ecog))
#   0    1    2    3 NA's 
#  63  113   50    1    1

On fitting the survival model against ph.ecog it really does make a difference whether the variable enters as a numeric or a factor. If treated numerically, only a single coefficient is fitted (for a given individual, the value for ecog is multiplied by this coefficient before entering into the coxph calculation);

coxph(Surv(time, status) ~ ph.ecog)
Call:
coxph(formula = Surv(time, status) ~ ph.ecog)

         coef exp(coef) se(coef)   z       p
ph.ecog 0.476      1.61    0.113 4.2 2.7e-05

Likelihood ratio test=17.6  on 1 df, p=2.77e-05  n= 227, number of events= 164 
   (1 observation deleted due to missingness)

However, treated as a factor, three different coefficients will be fitted, one for each non-reference level (ie, levels 1 2 and 3 each have a coefficient) and for a given individual you would look up the coefficient corresponding to the level of the ecog factor.

> coxph(Surv(time, status) ~ as.factor(ph.ecog))
Call:
coxph(formula = Surv(time, status) ~ as.factor(ph.ecog))

                     coef exp(coef) se(coef)    z       p
as.factor(ph.ecog)1 0.369      1.45    0.199 1.86 6.3e-02
as.factor(ph.ecog)2 0.916      2.50    0.225 4.08 4.5e-05
as.factor(ph.ecog)3 2.208      9.10    1.026 2.15 3.1e-02

Likelihood ratio test=18.4  on 3 df, p=0.000356  n= 227, number of events= 164 
   (1 observation deleted due to missingness)

Look into how the coefficients enter the survival model in a good Generalised linear model book (I really can't explain that quickly for you)