Hello everyone.
I am doing several survival analyses on TCGA data. One of my goals is to detect genes related to survival in KIRP data. To do so, I have evaluated each of the candidate genes through a univariate analysis with the coxph function of the survival R package. Subsequently, I selected those genes significantly related to survival (P <= 0.05) and built a multivariate model with the same function with about 30 genes. However, I would like to determine the best subset of genes to perform this task. To do so, I have evaluated the use of AIC (Akaike Information Criteria) and the glmnet package (penalized maximum likelihood) to build a model with the most interesting subset of genes. For the AIC approach, I used the stepAIC function of the MASS package, which performs a stepwise search for the best model through backward and forward selection methods. For the glmnet approach, I have used cross-validation to select the value of lambda that gives the minimum mean cross-validated error, selecting genes whose coefficients are not zero at this value.
I would like to know if the approximations made are correct or if I should make another approximation. If so, which method would you use to select the variables for the final model? Thank you in advance.