Hi ,
I am trying to do a survival analysis for a gene using TCGA data, I did this by both ways, continuous expression value and discrete values (Low and high using median expression values). In both cases there is huge difference in p-values. Can anyone help me which way is better for survival analysis?
my command:
coxph(Surv(time, status) ~ expression, data = survdata)
results:
HR=0.82, logrankP= 0.02 (when I used discrete model)
HR= 0.87, logrankP= 0.00001 (when I used continuous model)
Thanks
Thanks Kevin, expression data is RSEM log2 and this is distribution.
https://ibb.co/pbWBV0g
Median expresssion values of this gene is 8.73 in 452 samples
What if you convert that logged data to Z-scores and then trichotomise it based on that?
nearly the same results using Z-scores data for discrete (logrankP= 0.02 ) & continuous model (logrankP= 0.00004 ).
You should check hazard ratios too, and their confidence intervals. If, in one situation, the hazard ratio is 0.6 but the upper 95% limit passes 1.0, then that is not as reliable as a situation where the upper 95% is 0.8. Same is true for the reverse where the hazard ratio may be 2.9 but the lower 95% limit is below or maintained above (1.0).
That is: check that the hazard ratio limits don't cross the 'barrier' of 1.0. It's just a simple extra check.
Thanks again, yes there is difference in HRs with confidence intervals (upper/lower 95)
Looking at that, I'd assume that continuous was more reliable. I think that it's okay to derive the p-value and HRs from the continuous variable and then just plot dichotomised variables in the survival plot. You just have to clearly state what you have done in the methods.
Thanks Kevin for your help, I found a relevant article on this issue.
Comparing continuous and discrete analyses of breast cancer survival information
https://www.sciencedirect.com/science/article/pii/S0888754316300684
No problem.