I have a rna seq dataset of 160 samples (80 from patients with a certain disease and 80 from healthy individuals). I want to use a small set of genes to predict disease status and therefore would like to use lasso regression. I am also aware that my dataset is relatively small so I would also like to perform cross-validation to test the model created by the genes I have selected through lasso regression. However, I am not sure on how to do this. Would the following be a good way to use lasso regression and k fold cv to predict disease status? -->
create a training and validation split (70:30)
perform lasso regression to select the best gene combination using the training set only
remove all the gene/gene counts which are not part of the best gene combination in the test set
use all the genes as variables in my test set to calculate the predictive value of my gene combination with k-fold cross validation
hello, is this a regression case or a classification one? it may be very hard to get good results with 160 rows only. feel free to try the random forest regresor too. can you get more data?
You have several links to similar questions on the right side of this page.
On a small dataset such as yours, I don't think that a simple validation will do. A preferred way is to do a Lasso with cross-validation (CV), which will test many alpha parameters (I guess in R implementation it is called a lambda parameter) and find the one that is optimal. A python implementation of that procedure is available:
This will apply different regularizations, where larger alpha/lambda means less regularization, and smaller alpha/lambda means more regularization. Smaller alpha will shrink coefficients of more features (genes, in your case) to zero, thus eliminating a larger number of genes. That could lead to underfitting. Larger alpha will shrink fewer coefficients, which means more genes will be retained. That could lead to overfitting if meaningless genes are included. In short, doing a CV procedure will find optimal alpha that balances everything. I suggest at least 10-fold CV, and even 20-fold might be needed. Once the best alpha/lambda is found, you can print feature coefficients. Those genes that have zeros as feature coefficients can be excluded from modeling.
But in lasso regression alpha is 1 per definition right (at least in R)? From what I understand, optimal lambda values can be calculated with cv and minimizes prediction error rate by using many variables (lambda min) or the least amount of variables (lambda 1se). However, using your method, I'm not using any validation method (cv or bootstrapping) to validate that the AUC I can find, will be stable (does not change if I repeat all procedures) right?
Like I told you, I think that what is called alpha in python LassoCV is equivalent to lambda in glmnet (or whatever other R implementation).
The rest of your writing is not clear to me. In lasso regression you are minimizing an error to your two outcomes (healthy/disease), which can be represented by [0, 1]. Minimizing that error by regression requires a correct ranking of samples, which is essentially the same as maximizing AUC in classification procedures. Once you find the alpha(python)/lambda(R) parameter by CV, there is no need to do any additional CV or bootstrapping because its stability will already be tested during LassoCV.
If in R there is no built-in CV procedure for Lasso like in python, you may have to build one on top of Lasso.
It should be under normal circumstances, but not sure if that will hold for such a small dataset.
Instead of Lasso, you can do Logistic regression with cross-validation using L1 penalty, which will be an equivalent to Lasso. That way you can also eliminate some genes, while directly maximizing AUC. Here you need to optimize a parameter called C, which is also a regularization factor.
this is another machine learning algorithms that mensur recommended. i did use logistic regression in some machine learning projecs, and it worked for me very well. i agreed with him. feel free to try it!
hello, is this a regression case or a classification one? it may be very hard to get good results with 160 rows only. feel free to try the random forest regresor too. can you get more data?