Hi,
I'm trying to perform lasso regression and cross-validation on a RNA seq dataset to create a combination of RNAs that could most accurately predict a disease status. However, I am not sure if what I'm doing is the best way and would like some advise if this is the best way to go forward. Also, I have a question on how to view the coefficients in my lasso regression model at the bottom of my text.
Some background information: I have 192 samples within two classes (healthy and disease) of my dataset. Therefore, I think that cross-validation would be more appropriate to evaluate my model than a train-test split. Also, I would like to use a fairly low amount of variables/genes to best predict disease status, so that is why I'm using lasso regression as a machine learning method.
To create the model I have used the caret package:
myControl <- trainControl(
method = "cv",
number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE
To evaluate the performance of the model I have used the caret and glmnet package:
model <- train(
disease_status~ .,
my_dataset,
method = "glmnet",
trControl = myControl
)
By printing the output of model
, I can find that the accuracy of the model is at an alpha of 1.0 (and lambda of 0.01334) gives an roc of 0.82. However, I don't know how to print the classifiers that are used by my lasso regression model to receive this result. Coef(model)
just returns NULL
. Can anyone help me with these questions: if lasso regression and CV is best for solving my problem and how to print lasso classifiers?
Thanks for answering! One question about CV: do I have to make a train (70% of all samples) - test split (30% of all samples) and perform CV and a machine learning method on the train set and evaluate performance on the test set? Or do I have to perform everything on the original dataset and calculate the predicting accuracy of the model on the original dataset including all the samples?
Yes, the idea is to perform CV on a train dataset and evaluate on test. For this to be reliable, some kind of a stratified split is needed so the datasets have the same proportion of classification categories, and ideally the same distributions.
You don't have to use the whole data for training, and generally it is only done when the dataset is very small and one can't afford to set aside any data for validation. In such cases leave-one-out CV is often performed instead of N-fold CV. Generally speaking, with smaller datasets higher fold number (10+) tends to help with generalization, while with larger datasets 3- or 5-fold CV is usually sufficient.
There is no hard rule, but your intended split (
70:30
) is unusual. I think most people go for75:25
or80:20
splits.I noticed that
glmnet
may be calling itslasso
regularization factor lambda, while in python implementation that factor is called alpha. If so, the regularization value you got (0.01334) is probably fine.It seems that your command gave
NULL
because you didn't specify for which lambda the coefficients should be shown. You may find this page helpful in how to print thelasso
coefficients.Finally, if an answer was helpful and/or solved your problem, it is customary to upvote and/or accept it.