Hi everyone, as with the output of most next generation sequencing technologies, I have a large number of parameters (in my case I have computed some scores for a large number of genes so they are not exactly differential expression but they are still a numerical variable) yet I only have a small sample size. I'm rather new to this area of statistics but I'm facing the challenge of how to aggregate this large volume of data into making clinically meaningful inferences/predictions?
I now have parameters in the unit of thousands (~2000) and only 25 samples, and the end goal is to build a model with less parameters (after filtering from parameter selection) predicting nodal staging in cancer, so it can either be a continuous variable (e.g. stage 0 - stage 3) or it can be a discrete variable as well (e.g. > stage 2 or not). From what I gather after browsing various threads here and a bit of research, some methods to approach this problem are:
1) Univariate regression for each parameter first then selecting significant parameters to put into a multivariate regression model (from Performing univariate and multivariate logistic regression in gene expression data)
2) Stepwise linear/logistic regression (from Building a predictive model by a list of genes) which probably is less tedious compared to manually running all the univariate regressions
3) Lasso or elastic-net regression (from How to exclude some of breast cancer subtypes just by looking at gene expression?) to perform parameter selection as well as model fitting
4) Random forest regression
So my main questions now are:
- As I'm unfamiliar with the underlying math/statistics, is there any guide or rule of thumb on which approach is preferable or are there any conditions that can help decide what approach should be used?
- For lasso regression/random forest, it seems in tutorials I read that there is usually a training set and a testing set, but given my low sample size, can I put all observations into the training set or is it a must to still leave a few observations to act as the testing set?
- For lasso regression, how do I optimize the alpha parameter (since most tutorials mention how to optimize the lambda parameter) using the
glmnet
package? - From my understanding, is random forest not able to perform parameter selection to reduce the number of explanatory variables (as in it keeps all the parameters I input and infers missing values when necessary but won't remove irrelevant parameters like lasso regression would)?