I am developing machine learning models to classify disease/non-disease patients using gene expression data. I have applied LASSO to select features and built classifiers using some of the top features after feature selection. Now I have to do external validation on an independent test set to judge my model's generalizability. The problem I am facing is while doing this part.
The training set is built using a GEO dataset that is built on [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array platform. However, the test set of my interest that has the same disease/non-disease labels as my training is an Array Express dataset built on A-MEXP-2210 - Illumina HumanHT-12_V4_0_R1_15002873_B platform. So, some of the top features that I would have selected from my test set to validate my model, is missing altogether. What should be the ideal way to validate here?
- Use only those genes that are there in test set, select those genes from training set and build a model?
- KNN-impute those genes in the test set and do the analysis?
- Assign expression value zero for those genes in the test set and do the analysis?