Hi guys,
So this is more of a basic Biostatistics question. I have, say 100 samples, and in each I have a final biological outcome, like say presence or absence of diabetes. Now, for each of these 100 patients, I also have data for many different metabolites in the body, like Glucose, Fructose etc. I want to know what is the effect/contribution of each of these variables on the biological outcome. In other words, the correlation of each of these variables with the outcome - so that I know which variable is contributing the most (etc) to the outcome.
What is the best way to do this? If there was just one variable, then I think linear regression would have worked. But in this case with multiple variables, how do I do that?
If there are multiple methods to do this, I would love to know of all of them. I will read up on them! Thanks.
It's can be addressed by building any machine learning models with variable selection feature, such as logistic regression model with Lasso regulation. In this model, you could consider weight as the importance of each metabolites. Although actually weight=0 does not mean that feature is no use, it may caused by the redundant features...
Thanks a lot Shoujon! Could you recommend me some implementation of the model you mentioned? Maybe in R or Python. Would love to play around with it!
You could check: https://scikit-learn.org/stable/. Not very familiar with machine learning libraries in R.