I have developed a classification model using microarray data using a GEO dataset. Let's call this the train set. The class labels were "extreme" vs "not-so-extreme" disease course. Now my advisor has asked me to see the generalizability of the model. But there is no other dataset with the same set of labels as described above. However, it has been known that the "extreme" disease course often leads to death and vice versa. So now I am looking for datasets with mortality labels [Survivor and Deceased] and also found one. Let's call this the test set. So "Extreme" label of the train has been matched to "Deceased" of test and vice versa.
Here's where the problem starts. I have taken the best set of features, did parameter tuning on the train set alone and now when I am validating on the test, I am getting around 0.50 AUC. I don't know whether it's because of the way I have defined the labels or due to the different microarray platforms using which the data has been collected. The training dataset is based on the "[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array" platform and the test data is based on the "Illumina HumanHT-12_V4_0_R1_15002873_B" platform. Can somebody help?