Hello,
I have used cv.glmnet to develop a model based on my training set. But since the number of sample in my dataset is only 205, partitioning into 80% and 20% leave very small sample size for testing. I want to test my final model on the whole dataset rather than just testing.
I used this code:
dput(y_train[1:5])
c(`GTEX-14LZ3` = 1.30204166535137, `GTEX-13QJC` = 0.767120841644941,
`GTEX-11DZ1` = 0.281033091212483, `GTEX-14C5O` = -0.589082784743255,
`GTEX-1F7RK` = -0.210001426831818)
> dput(x_train[1:5,1:5])
structure(c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), dim = c(5L,
5L), dimnames = list(c("GTEX-14LZ3", "GTEX-13QJC", "GTEX-11DZ1",
"GTEX-14C5O", "GTEX-1F7RK"), c("22_45255830_C_T_b38", "22_45255847_C_T_b38",
"22_45256000_C_T_b38", "22_45256064_A_G_b38", "22_45256248_A_G_b38"
)))
set.seed(15)
indexes <- sample(dim(cis_gt)[1],140,replace=FALSE)
x_train <- cis_gt[indexes,]
y_train <- expr_resid[indexes,]
x_test <- cis_gt[-indexes,]
y_test <- expr_resid[-indexes,]
# Inner-loop - split up training set for cross-validation to choose lambda.
# Fit model with training data.
set.seed(20)
fit <- cv.glmnet(x_train, y_train, nfolds = n_folds, alpha = alpha, type.measure='mse')
# Predict test data using model that had minimal mean-squared error in cross validation
y_pred <- predict(fit, x_test, s = 'lambda.min')
cor(y_pred,y_test,method="spearman")
I wanted to ask whether I can do this since 80% of the dataset was already used during training? I basically want to do these steps: Randomly split the data into 5 folds.
For each fold:
a. Remove the fold from the data.
b. Use the remaining data to train an elastic-net model using 10-fold cross-validation to tune the lambda parameter.
c. With the trained model, predict on the hold out fold, and get various test statistics for how the model performs.
Calculate the average and standard deviation of each of the significance statistics, where applicable. This should provide a reasonable estimate for how well the model will generalize to new data.
Train a new elastic-net model using all of the data. Again, use 10-fold cross validation to tune the lambda parameter.
I am not able to understand how to do folds using cv.glmnet. If anyone can provide some guidance. I thought cv.glmnet n_folds option do this?