Hi all, I am not an expert in machine learning (ML) and have a few specific questions regarding the design of a binary classifier. I have bulk RNA-seq data for the samples from 6 different cancer types. These samples belong to either class A or B. So, for each cancer type I have 10 samples from class A and 10 samples for class B. I have, therefore, total 120 samples (20 samples for each of the six cancer types; these 20 samples are evenly split between classes A and B).
I would like to create a classifier that can classify samples into either class A or B. I can divide 120 samples randomly into training and test set and follow a regular ML workflow on scikit learn and try different models (Logistic regression, SVM, and so on). One issue with that is how to do feature selection. I could do differential expression (DE) analysis using DEseq2 and get the set of DE genes between classes A and B for each cancer type and then use the common DE genes across the 6 cancer types as input features for the binary classifier. But that would lead to leakage between the training and test sets as features should come from training set and not the test set. If I mix the 120 samples randomly into training and test set, the test will have samples that were used to define the input features (common DE genes across the 6 cancer types).
I could use samples from any of the 4 cancer types as training and remaining samples from the 2 cancer types as test. Then, I can use the common DE genes across the 4 cancer types as input features for the training step and later use the trained model to check the prediction on the samples in the test set. But how can I make it unbiased? Which 4 cancer types to use for training? Is there a better way to design this classifier? Or better ways to select the features?
So sorry for the long description. Thanks in advance for any suggestions or comments. I would really appreciate any help.
Did you read sklearn page on feature selection ? Maybe by removing unexpressed genes, 0 or very low variance genes (be careful on how you choose to normalize/scale data), then going with a L1 penalty for your linear model, or doing a feature selection recursively via cross validation might be a first choice. I've seen many paper where they used a PCA to retain principal components as features for the model: that's another possible way, as explained also here.
Regarding cancer types, just make the dataset balanced between the two outcome classes (A,B). Do not split cancer types: your model has to watch as much as possible to learn something.
Thanks so much for your suggestions! I will go over the links you sent and try the things you mentioned. I agree that not splitting cancer types is better. That was my plan originally as well.