Question

Lasso regression and cross-validation on RNA seq dataset

1

Entering edit mode

2.5 years ago

Jipjop ▴ 10

Hi,

I'm trying to perform lasso regression and cross-validation on a RNA seq dataset to create a combination of RNAs that could most accurately predict a disease status. However, I am not sure if what I'm doing is the best way and would like some advise if this is the best way to go forward. Also, I have a question on how to view the coefficients in my lasso regression model at the bottom of my text.

Some background information: I have 192 samples within two classes (healthy and disease) of my dataset. Therefore, I think that cross-validation would be more appropriate to evaluate my model than a train-test split. Also, I would like to use a fairly low amount of variables/genes to best predict disease status, so that is why I'm using lasso regression as a machine learning method.

To create the model I have used the caret package:

myControl <- trainControl(
method = "cv",
number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE, 
verboseIter = TRUE

To evaluate the performance of the model I have used the caret and glmnet package:

model <- train(
disease_status~ .,
my_dataset,
method = "glmnet",
trControl = myControl
)

By printing the output of model, I can find that the accuracy of the model is at an alpha of 1.0 (and lambda of 0.01334) gives an roc of 0.82. However, I don't know how to print the classifiers that are used by my lasso regression model to receive this result. Coef(model) just returns NULL. Can anyone help me with these questions: if lasso regression and CV is best for solving my problem and how to print lasso classifiers?

lasso regression rna-seq cross-validation • 2.1k views

ADD COMMENT • link updated 2.2 years ago by Mensur Dlakic ★ 28k • written 2.5 years ago by Jipjop ▴ 10

score 2 · Answer 1 · 2022-05-20

It is not clear whether you have 192 datasets or 192 features in some undefined number of datasets, but either way cross validation is a smart choice.

Your approach to feature selection is sound in general, but the alpha value may be too high. Don't know if you arrived at that via cross-validation from a large list of alphas or you picked it yourself. Either way, alpha=1.0 will shrink relatively few feature coefficients. Maybe that's exactly what you want, but it may lead to overfitting as it will be dealing with too many features.

As to whether lasso is the best way, I'd say it is one of many ways. Tough to know exactly which one will work best without trying. Just about any tree-based classification method should work, and they tend to be better than lasso as long as one is careful not to overfit. They natively perform feature selection, though not necessarily by shrinking any coefficients to zero. You will still get relative feature ranking, and can select a desired number of features from the list. See below for a feature ranking example which is extracted from a gradient boosting classifier.

Finally, searching for feature selection on this site will likely give you more ideas. Here are several discussions where I contributed, so this is a self-promotion:

enter image description here