Hi,
I'm trying to use lasso regression to create a model to solve a classification problem (predict disease status). My datasets contains >16000 variables (RNA transcripts from genes) but I'd like to find only 5 or 10 genes or so, that can best predict disease status. However, using the code lines from this source: http://www.sthda.com/english/articles/36-classification-methods-essentials/149-penalized-logistic-regression-essentials-in-r-ridge-lasso-and-elastic-net/#compute-lasso-regression, it is not possible to set a fixed number of variables to use. Also I'm not sure if lasso regression can be used for this purpose. The code lines I'm currently using:
#divide dataset into training and testing samples
traindata<-dataset[trainingsamples,]
testdata<-dataset[-trainingsamples,]
x<-model.matrix(patient.group~., traindata)[,-1]
y<-ifelse(traindata$patientgroup=="disease",1,0)
# Find the best lambda using cross-validation
set.seed(123)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
#find which variables are being used
tmp_coeffs <- coef(cv.lasso, s = "lambda.1se")
data.frame(name = tmp_coeffs@Dimnames[[1]][tmp_coeffs@i + 1], coefficient = tmp_coeffs@x)
Right now I'm using 42 genes to predict disease status which gives a good accuracy. However, does anyone know how one can reduce the amount of variables being used? Or do I have to use another machine learning strategy to do so?
Thanks!