If your variable hercoc
has only two levels then there is no difference. However, if it has 3 or more levels then there is a difference. You haven't provided any example data, and I am assuming that hercoc
is numeric.
Using a more concrete example:
library(survival)
attach(lung)
head(lung)
# inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
#1 3 306 2 74 1 1 90 100 1175 NA
#2 3 455 2 68 1 0 90 90 1225 15
#3 3 1010 1 56 1 0 90 90 NA 15
#4 5 210 2 57 1 1 90 60 1150 11
The variable sex
has two levels but is coded as a numeric variable here (1 for male, 2 for female or vice versa).
summary(as.factor(sex))
1 2
138 90
There's only one coefficient fitted by the model regardless of how you write it:
coxph(Surv(time, status) ~ sex)
# Call:
coxph(formula = Surv(time, status) ~ sex)
coef exp(coef) se(coef) z p
sex -0.531 0.588 0.167 -3.18 0.0015
Likelihood ratio test=10.6 on 1 df, p=0.00111 n= 228, number of events= 165
coxph(Surv(time, status) ~ as.factor(sex))
Call:
coxph(formula = Surv(time, status) ~ as.factor(sex))
coef exp(coef) se(coef) z p
as.factor(sex)2 -0.531 0.588 0.167 -3.18 0.0015
Likelihood ratio test=10.6 on 1 df, p=0.00111 n= 228, number of events= 165
For the variable ph.ecog
there are 4 levels
summary(as.factor(ph.ecog))
# 0 1 2 3 NA's
# 63 113 50 1 1
On fitting the survival model against ph.ecog
it really does make a difference whether the variable enters as a numeric or a factor. If treated numerically, only a single coefficient is fitted (for a given individual, the value for ecog is multiplied by this coefficient before entering into the coxph calculation);
coxph(Surv(time, status) ~ ph.ecog)
Call:
coxph(formula = Surv(time, status) ~ ph.ecog)
coef exp(coef) se(coef) z p
ph.ecog 0.476 1.61 0.113 4.2 2.7e-05
Likelihood ratio test=17.6 on 1 df, p=2.77e-05 n= 227, number of events= 164
(1 observation deleted due to missingness)
However, treated as a factor, three different coefficients will be fitted, one for each non-reference level (ie, levels 1 2 and 3 each have a coefficient) and for a given individual you would look up the coefficient corresponding to the level of the ecog
factor.
> coxph(Surv(time, status) ~ as.factor(ph.ecog))
Call:
coxph(formula = Surv(time, status) ~ as.factor(ph.ecog))
coef exp(coef) se(coef) z p
as.factor(ph.ecog)1 0.369 1.45 0.199 1.86 6.3e-02
as.factor(ph.ecog)2 0.916 2.50 0.225 4.08 4.5e-05
as.factor(ph.ecog)3 2.208 9.10 1.026 2.15 3.1e-02
Likelihood ratio test=18.4 on 3 df, p=0.000356 n= 227, number of events= 164
(1 observation deleted due to missingness)
Look into how the coefficients enter the survival model in a good Generalised linear model book (I really can't explain that quickly for you)
In R the factor data format should be used for categorical data. For example, if you were doing survival analysis for three different treatments
Then you should pass this vector as a factor because the data are categorical. If you did not do this then R would assume the data are continuous and might cause misinterpretations of the results.
On the other hand, if the treatment was of one drug but at different concentrations such as
Then you should not factor these data because they are continuous.
At least that's my understanding, others please chime in