I am analyzing gene X expression in the context of overall survival for a TCGA dataset. I want to take a data driven approach which determines the optimal cutoff for maximum significance between arms (high and low).
Is this approach acceptable and what kind of biases am I working with? I've seen numerous papers with this type of approach for determining cutoffs for KM survival analysis... but I know that there are other options for determining cutoffs such as median or quartile extremes.. or Cox instead of KM (but I really don't consider my circumstance to be a continuous variable).
Also, if I continue with the optimal cutoff... can I do permutation testing to see if it is real? What would my null be to test against... randomized gene expression values while keeping cohort size the same... randomized gene expression values with new optimal cutoffs determined (and allowing cohort size to change)...?
Thanks in advance!
I also have similar questions, I downloaded data from xenophobic browser that hosted TCGA data, when I want to compare high or low expression, it seems difficult to classify.
This is not an answer to the question. I'm moving it to a comment.