Question

TCGA survival analysis: continuous vs discrete expression values

0

Entering edit mode

6.5 years ago

Mike ★ 1.9k

Hi ,

I am trying to do a survival analysis for a gene using TCGA data, I did this by both ways, continuous expression value and discrete values (Low and high using median expression values). In both cases there is huge difference in p-values. Can anyone help me which way is better for survival analysis?

my command:

coxph(Surv(time, status) ~ expression, data = survdata)

results:

HR=0.82,  logrankP= 0.02  (when I used discrete model)
HR= 0.87,  logrankP= 0.00001  (when I used continuous model)

Thanks

survival coxph Cox model • 3.2k views

ADD COMMENT • link 6.5 years ago by Mike ★ 1.9k

score 2 · Accepted Answer · 2019-02-14

2

Entering edit mode

6.5 years ago

Kevin Blighe 89k

When you convert the data to discrete values, you are eliminating information, as I elaborate here in an extreme example: A: Why quantitative design are preferred GWAS approach In the process, you also make it more readily interpretive to the human brain. Simply using Low and High may be too few categories. You could try introducing more categories.

If your data is on the continuous scale, you need to be aware of the distribution that it follows and whether you have processed it correctly.

ADD COMMENT • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks Kevin, expression data is RSEM log2 and this is distribution.

https://ibb.co/pbWBV0g

Median expresssion values of this gene is 8.73 in 452 samples

ADD REPLY • link 6.5 years ago by Mike ★ 1.9k

0

Entering edit mode

What if you convert that logged data to Z-scores and then trichotomise it based on that?

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

nearly the same results using Z-scores data for discrete (logrankP= 0.02 ) & continuous model (logrankP= 0.00004 ).

ADD REPLY • link 6.5 years ago by Mike ★ 1.9k

1

Entering edit mode

You should check hazard ratios too, and their confidence intervals. If, in one situation, the hazard ratio is 0.6 but the upper 95% limit passes 1.0, then that is not as reliable as a situation where the upper 95% is 0.8. Same is true for the reverse where the hazard ratio may be 2.9 but the lower 95% limit is below or maintained above (1.0).

That is: check that the hazard ratio limits don't cross the 'barrier' of 1.0. It's just a simple extra check.

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks again, yes there is difference in HRs with confidence intervals (upper/lower 95)

HR         HRlower   HRupper
0.82      0.770        1.01       (discrete)
0.87      0.75         0.97   (continues)

ADD REPLY • link 6.5 years ago by Mike ★ 1.9k

0

Entering edit mode

Looking at that, I'd assume that continuous was more reliable. I think that it's okay to derive the p-value and HRs from the continuous variable and then just plot dichotomised variables in the survival plot. You just have to clearly state what you have done in the methods.

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

1

Entering edit mode

Thanks Kevin for your help, I found a relevant article on this issue.

Comparing continuous and discrete analyses of breast cancer survival information

https://www.sciencedirect.com/science/article/pii/S0888754316300684

ADD REPLY • link 6.5 years ago by Mike ★ 1.9k

0

Entering edit mode

No problem.

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k