It's my first time trying to implement elastic net algorithm on microarray data and I'm facing some issues.
My dataset DATA
is a data frame that is consisted by 72 observations (rows) and 18703 predictors (columns). Observations are stored as the 18704th column in the data frame and are in replicates like the following.
DATA["classes"] = c( rep("A",10),
rep("B",13),
rep("C",3),
rep("D",10),
rep("E",10),
rep("F",10),
rep("G",10),
rep("H",3),
rep("I",3))
The way I'm trying to execute the elastic net is the following
# Random Sampling with 70-30% for training and validation respectively
y = z = 0
while(y != 9 || z != 9){
sample = sample(x = 1:nrow(DATA) , size = 0.7 * nrow(DATA) )
train = DATA[sample,]
test = DATA[-sample,]
y = length(unique(train$classes))
z = length(unique(test$classes))
}
# Execute the glmnet CV
y = as.factor(train$classes)
cvfit = cv.glmnet(train[,-18704], y, family="multinomial", alpha=0.9)
The thing now is that I'm getting this error that I cannot understand what it actually means.
Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
(list) object cannot be coerced to type 'double'
In addition: Warning message:
In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
one multinomial or binomial class has fewer than 8 observations; dangerous ground
Is this because the replications of each observation are not even (some have 10 and others 3 ) or is it something different ?
Also,while reading the documentation of the cv.glmnet
I couldn't understand what the nfolds argument stands for. So if anyone can explain this it would be very helpful.
Thanks in advance.
Thank you very much for the explanation. Very informative and clear. I'm going to try it out.
Although I don't understand the following you said :
If you create chunks equal to the total number of your samples for your training set (e.g 10), which sample will be the one that is left out ? I mean you have 10 samples and you use all of them to train the algorithm. Why we call it leave-one-out ?
Hi arronar, when you choose nfolds as the total number of samples, which as I mentioned is 'leave-one-out' cross validation, your dataset is 'randomly' divided into chunks, each chunk containing a single sample. If you have 100 samples, 99 will be tested separately, on their own, as part of the training data, and just 1 sample will be tested as the validation. This is then repeated 100 times, each time a different configuration of training and validation samples are used.
Some people don't like this type of cross validation specifically because only 1 sample is used as the validation. Most people do 3-fold.
Your next question may be about how the program 'randomly' chooses the samples for training and validation. There are specific circuits built into a computer system that allow it to pick what I would call a 'virtually random' number. R, which is primarily written in C and Fortran programming language, will probably use built in functions of those languages in order to pick a 'random' number, which is then used to break up your datasets. However, the key point is that you can control how R breaks up the dataset by setting a fixed 'seed' value with the function
set.seed()
, e.g.,set.seed(123456)
, prior to runningcv.glmnet
. Then, each time you run the function with the chosen seed, the dataset will be divided in the same way.Thank you once more. You are very instructive.
Absolutely no problem. Anymore questions, just ask :)
So.Once I've found you, let me ask one last thing about weights. Is it reasonable to set weights to your samples if you don't know their difference beforehand ? So in my case, I can not tell if sample A is a bit more important than the sample H and thus I cannot set specific weight values for them. Am I right or I missing something ?
Also even with the changes that I've done, algorithm seems not to work properly.
I changed the classes into numbers and used nfolds = 3
I tried with many different nfolds values but still getting the same error :S
Hey again arronar,
Just regarding the new error that you're getting. It at least allows us to infer that the data formats are okay, because each has passed the basic checks of the function. Also, I have read up on the behaviour of the function and the y variable is fine as either numeric or factor, whilst x does need to be converted to a data-matrix.
The error one multinomial or binomial class has 1 or 0 observations; not allowed seems to be related to your factor level 13, of which there is only 1 sample. If you remove that factor level, then it may run. However, there are additional warning messages that relate to the fact that all of your factor levels have low numbers.
Is this just training data, like, for practice, or is it for a project that could be published? If you really do just have 9 samples, then it will be difficult to extrapolate meaningful stats from this. As an example, the last time that I ran a lasso regression, I had ~450 samples and ~500 variables. The lasso regression has also been used in breast cancer to identify mutation signatures, on datasets of >1000 samples and >100,000 mutations.
To answer your question on the weights, I think that the weights actually relate to the variables and not the samples. As a practical example, if you were looking at gene expression in a specific tissue, you could assign as weights to the genes the level of specificity of expression of each gene in the tissue of interest. Some genes are expressed at very high levels in some tissues, and not at all in others.
Hope that this helps some more!
Hmm something is not well understood either by you or by me. I don't think that I have a factor level 13 with only one sample. The y variable is consisted from the following:
So to answer to you, my total samples with their replications are 62 and not 9. Also as you can see there is no factor that has only one sample.
I'm just playing with these data for now. I don't know what future will bring.
I got tired to thank you for your informative answers and thus, now I call you to treat you a beer :-)
Apologies, I misinterpreted your code regarding the y variable! However, is it 62 samples or 72? Your y vector contains 72 elements. Also, the factor levels with just 3 samples are still going to throw (produce) the warning message - you may have to remove them in order to obtain reliable results from the cross-validation.
Finally, the error message may relate to how your variables are too 'flat' (lack variation) in some of the factor levels, e.g., for the factor levels with just 3 samples. In your data-matrix, are there many missing or 0 values?; or are there any values that are the same across all samples (or even just all samples within a factor level)?
If you run the
hist()
function on your data-matrix, does the distribution look normal?Sure you can send the beer to Ireland!
Hey!
You were right. After delete the samples with the small replication number, algorithm run as expected. Also I tried to create and a matrix with the most significant variables from the sparse matrix that returned from the
coef()
function. Take a look and tell your opinion.Fantastic - really happy to have helped out.
The procedure of the significant genes is right isn't it ?
I'm happy too :-). At eleven tomorrow morning you are going to receive a package with beers from DHL at your office :-)
Yes, I believe so: you have to look for the variables whose coefficients have not been 'shrunk' to 0.
After I ran my Elastic model, I performed the following code:
I then look at the coefficients and pick the best variable predictors. After that, I build a GLM with the best predictors:
I then perform Wald's test to get a single P value for the model:
...and then I produce a ROC curve:
Great! I will be expecting some lovely Austrian beer tomorrow!
Thank you very very much!
Have a nice rest of the day!