I am working on a patch-seq dataset where ~100 cells have been recorded in two conditions (say P for physiology and D for Desease, the dataset is balanced).
In order to determine if the two conditions can be separated based on the transcripts,
I repeatedly train machine learning models (Keras, Logistic Regression, SVC, ...). using repeated KFold validation .
Each fold yields a very small set (10-15 cells) with y_true and y_pred (true condition and predicted condition).
By accumulating these small sets (classifier is reset in a random state each Fold), I can evaluate the "average" precision of the models (and optionally extract the important features with logistic regression.
Results
All models give me ~0.75 accuracy on average, throughout trials (importantly,a dummy classifier will not separate both conditions).
I then select the most ~100 most informative genes from the Logistic regression (which appear to be physiologically relevant - good point)
Questions:
Is it "legal" to re-train my models using only these genes?
When doing so, is it normal that the accuracy gets astonishingly high (for all models)
Is it "legal" to re-train my models using only these genes?
If you take the training results you aggregated over the whole dataset and use these to re-train a final classifier, what you are doing is out-of-loop feature selection. Doing so will necessarily give inflated accuracy scores, since you have already used information from the entire dataset. Because of this:
When doing so, is it normal that the accuracy gets astonishingly high (for all models)
Yes.
What you can do is either perform feature selection within the loop to estimate the accuracy of the procedure, then apply the procedure one last time on the full dataset.
Or: You can hold out 20 cells up-front, do whatever you want with the remaining 80 cells to build a classifier, and then "unblind" the original held-out 20.
split my dataset in train_test_set (~80%) and validate_set(~20%)
and did that 5 times (resulting in a nested cross-validation) you would have option 1. The resulting accuracy would be your estimate, and then you could remove the "outer" cross-validation to create the final classifier.
There is nothing wrong in eliminating uninformative features - Google feature elimination for more info on that subject. However, this should not be done so you get the highest training score. It should be done so you get the cross-validation (CV) score that generalizes best on unseen data.
Briefly, this means using K folds (K is usually 5 or 10; I will use 10 in my example) for validation in such a way that you train 10 models on 90% of data and validate always on a different 10% of data. During the training the model is made only from the train data, while the validation dataset serves as a control for early stopping. Then you average accuracy scores for those 10 models and get a cross-validated score which estimates how an average of those 10 models would perform on unseen data.
There are many sophisticated ways of removing features and some machine learning methods even know how to do it automatically. In a simple implementation, you remove each feature one at a time, and calculate CV scores for each of those datasets with reduced features. Highest CV score will tell you which feature is least important, so you discard it. Then you iterate through the remaining features, and keep doing so until removing a feature results in a lower score. It is important to always use the same folds for CV, or else the results will not be comparable.
Thanks,
If I get it well, you propose to (option 2):
(did'nt get option 1)
Yes. If you took bullet 1:
and did that 5 times (resulting in a nested cross-validation) you would have option 1. The resulting accuracy would be your estimate, and then you could remove the "outer" cross-validation to create the final classifier.