I have around 200 cancer patient gene expression data, and want to build a cancer subtype classifier.
What is the correct why way to treat with the data? Divide the data into training, cross validation and test set? Training the classifier with training set, cross validation with cross validation set and test with test set? What is a good proportion of this data sets?
Good point. I agree it's best to leave out a true test set when you can, but with only 200 samples I wonder if that 20% would better serve to help improve training than for testing. I suppose the appropriate choice depends on the data set (how homogeneous it is, how many cancer subtypes, etc) and the context of the experiment (eg, is your goal to publish?).
I used PAM implemented in pamr package..
It seams for this package all people use cross validation data the same with training data.