I read a paper on cancer subtype classification of Glioblastoma. After they do unsupervised clustering on the gene expression data, the actually select sample (patients) that have positive silhouette score for supervised classification. From my idea, I don't think this a good approach or even wrong. The classifier might heavily overfitted with the samples (patients) one selected.
What your idear? Is it correct to choose training samples based on there performance of unsupervised clustering?
Do you have a link to the paper? Without more details it is hard to comment. In general, yes, if a subset of samples with strong class distinctions were selected for training, that sounds like it could be a dubious approach, likely to perform poorly. At the end of the day, any classifier needs to be validated on a suitable independent dataset that was not used for training. Do the authors of the paper show that validation?
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 10.5 years ago by
Ahill
★
2.0k
0
Entering edit mode
Thanks, see the update for paper links.
It is a quit established paper. The training and CV data all from a consensus clustering of 202 samples. It is impossible to "validate" directly on an independent dataset, because "Y" response (subtype) of the data comes from consensus clustering. What they did for validation is a heatmap showing the genes the selected for classifier. The heatmap of the validation data looks similar to the training data (see figure 2 of the paper).
Having scanned this paper, the approach is reasonable and not wrong. Nothing wrong with using a silhouette statistic to select representative samples here. Figure 2B makes the case that over-fitting did not happen here. This is a class discovery approach. I agree it's not possible to unambiguously "validate" on a test set much beyond what the authors have done, since there is no accepted external definition of the subtypes. They make a good case that in the separate validation set (Figure 2B), the distribution of subtypes is similar to their core sample set, and more importantly that there are correlations of subtypes with genetic and other clinically relevant factors, also in a validation set that was not used for definition of the subtypes, if I read correctly. The subtypes appear to provide a clinically or scientifically useful class assignment. Figure 4 suggests scientific utility, and Figure 5 a possible clinical utility (e.g. Proneural patients do not respond differentially to aggressive therapy).
ADD COMMENT
• link
updated 5.0 years ago by
Ram
44k
•
written 10.5 years ago by
Ahill
★
2.0k
1
Entering edit mode
I agree. What I'm not convinced by the paper is the error rate. They got a high CV error (8.9% or more) for all samples and then improved to 4.6% for 173 core samples. I suspect if apply the classifier to new dataset the error rate is still 8.9% or more, because one have no way to select patients in a new dataset. Also this is not easy to prove.
ADD REPLY
• link
updated 4.9 years ago by
Ram
44k
•
written 10.5 years ago by
juncheng
▴
220
Do you have a link to the paper? Without more details it is hard to comment. In general, yes, if a subset of samples with strong class distinctions were selected for training, that sounds like it could be a dubious approach, likely to perform poorly. At the end of the day, any classifier needs to be validated on a suitable independent dataset that was not used for training. Do the authors of the paper show that validation?
Thanks, see the update for paper links.
It is a quit established paper. The training and CV data all from a consensus clustering of 202 samples. It is impossible to "validate" directly on an independent dataset, because "Y" response (subtype) of the data comes from consensus clustering. What they did for validation is a heatmap showing the genes the selected for classifier. The heatmap of the validation data looks similar to the training data (see figure 2 of the paper).