I am trying to extract a discrete label (0/1) for a classification (supervised learning) task based on two pieces of information available for each patient, dfs.t and dfs.e, in different cancer related studies. My main concern here is the way that researchers fill in the dfs.e column for patients, this is what I think:
dfs.e = 0: no relapse/recurrence/distant-metastasis within dfs.t time frame
dfs.e = 1: relapse/recurrence/distant metastasis/death-caused-by-cancer occurred at dfs.t
Is this interpretation right? I was wondering if there is any conventional way for dealing with data like this.
Thanks in advance,
--Saman
I have seen survival package and used it for plotting KM graphs and running tests to compare survival times in two studies.
What I am interested in is to divide patients into two meaningful distinct groups based on dfs.t and dfs.e values. I cannot find anything useful in this regard!
From dfs.t and dfs.e you can extract "did/did not recur in a given time interval", which is a useful start. Selecting the time interval shouldn't be done casually; look at clinical papers or talk to a specialist to find out a meaningful interval. Finding the features to build your classifier is up to you...
I agree with David that looking to clinical literature is a good idea. For some diseases 5yr or 10yr DFS is the metric that everyone cares/talks about. But, another idea is to plot just the frequency of events (where dfs.e=1) versus their time (dfs.t). Is there a linear accumulation of events over time? Or, is there a point at which the rate of accumulation of events changes. That could inform your choice of cutoff.