Question

Logistic regression to separate 2 conditions

0

Entering edit mode

2 days ago

yves33 • 0

Hi all,

I am working on a patch-seq dataset where ~100 cells have been recorded in two conditions (say P for physiology and D for Desease, the dataset is balanced).

In order to determine if the two conditions can be separated based on the transcripts,

I repeatedly train machine learning models (Keras, Logistic Regression, SVC, ...). using repeated KFold validation .
Each fold yields a very small set (10-15 cells) with y_true and y_pred (true condition and predicted condition).
By accumulating these small sets (classifier is reset in a random state each Fold), I can evaluate the "average" precision of the models (and optionally extract the important features with logistic regression.

Results

All models give me ~0.75 accuracy on average, throughout trials (importantly,a dummy classifier will not separate both conditions). I then select the most ~100 most informative genes from the Logistic regression (which appear to be physiologically relevant - good point)

Questions:

Is it "legal" to re-train my models using only these genes?
When doing so, is it normal that the accuracy gets astonishingly high (for all models)

Thanks in advance

Marker-genes Logistic-regression • 285 views

ADD COMMENT • link updated 11 hours ago by Mensur Dlakic ★ 28k • written 2 days ago by yves33 • 0

score 0 · Answer 1 · 2024-11-25

0

Entering edit mode

2 days ago

LChart 4.6k

Is it "legal" to re-train my models using only these genes?

If you take the training results you aggregated over the whole dataset and use these to re-train a final classifier, what you are doing is out-of-loop feature selection. Doing so will necessarily give inflated accuracy scores, since you have already used information from the entire dataset. Because of this:

When doing so, is it normal that the accuracy gets astonishingly high (for all models)

Yes.

What you can do is either perform feature selection within the loop to estimate the accuracy of the procedure, then apply the procedure one last time on the full dataset.

Or: You can hold out 20 cells up-front, do whatever you want with the remaining 80 cells to build a classifier, and then "unblind" the original held-out 20.

ADD COMMENT • link 2 days ago by LChart 4.6k

0

Entering edit mode

Thanks,

If I get it well, you propose to (option 2):

split my dataset in train_test_set (~80%) and validate_set(~20%)
run kfold validation using train_test_set to isolate relevant genes (logistic regression) +++fold 1->remember relevant genes_1 +++fold 2->remember relevant genes_2 (...)
build a list of relevant genes keeping only those that appearas relevant in 75% of [genes_1, genes_2, (...)]
train classifier(s) using these genes and train_test_set as training data
test these newly trained classifier on validate_set

(did'nt get option 1)

ADD REPLY • link 18 hours ago by yves33 • 0

0

Entering edit mode

Yes. If you took bullet 1:

split my dataset in train_test_set (~80%) and validate_set(~20%)

and did that 5 times (resulting in a nested cross-validation) you would have option 1. The resulting accuracy would be your estimate, and then you could remove the "outer" cross-validation to create the final classifier.

ADD REPLY • link 17 hours ago by LChart 4.6k

score 0 · Answer 2 · 2024-11-27

There is nothing wrong in eliminating uninformative features - Google feature elimination for more info on that subject. However, this should not be done so you get the highest training score. It should be done so you get the cross-validation (CV) score that generalizes best on unseen data.

Briefly, this means using K folds (K is usually 5 or 10; I will use 10 in my example) for validation in such a way that you train 10 models on 90% of data and validate always on a different 10% of data. During the training the model is made only from the train data, while the validation dataset serves as a control for early stopping. Then you average accuracy scores for those 10 models and get a cross-validated score which estimates how an average of those 10 models would perform on unseen data.

There are many sophisticated ways of removing features and some machine learning methods even know how to do it automatically. In a simple implementation, you remove each feature one at a time, and calculate CV scores for each of those datasets with reduced features. Highest CV score will tell you which feature is least important, so you discard it. Then you iterate through the remaining features, and keep doing so until removing a feature results in a lower score. It is important to always use the same folds for CV, or else the results will not be comparable.