How to find effect of different variables on a biological outcome?
1
0
Entering edit mode
5.2 years ago
Alex • 0

Hi guys,

So this is more of a basic Biostatistics question. I have, say 100 samples, and in each I have a final biological outcome, like say presence or absence of diabetes. Now, for each of these 100 patients, I also have data for many different metabolites in the body, like Glucose, Fructose etc. I want to know what is the effect/contribution of each of these variables on the biological outcome. In other words, the correlation of each of these variables with the outcome - so that I know which variable is contributing the most (etc) to the outcome.

What is the best way to do this? If there was just one variable, then I think linear regression would have worked. But in this case with multiple variables, how do I do that?

If there are multiple methods to do this, I would love to know of all of them. I will read up on them! Thanks.

statistics correlation • 1.1k views
ADD COMMENT
2
Entering edit mode

It's can be addressed by building any machine learning models with variable selection feature, such as logistic regression model with Lasso regulation. In this model, you could consider weight as the importance of each metabolites. Although actually weight=0 does not mean that feature is no use, it may caused by the redundant features...

ADD REPLY
0
Entering edit mode

Thanks a lot Shoujon! Could you recommend me some implementation of the model you mentioned? Maybe in R or Python. Would love to play around with it!

ADD REPLY
1
Entering edit mode

You could check: https://scikit-learn.org/stable/. Not very familiar with machine learning libraries in R.

ADD REPLY
4
Entering edit mode
5.2 years ago
Mensur Dlakic ★ 28k

One hundred samples is not a lot to work with in terms of statistics, but it may end up being sufficient if your variables are informative.

One can assess whether features are informative to the final outcome or not by plotting histogram distributions of your features for both final outcomes. Let's say that our features are called 33 and 34 and we plot their histogram distributions for conditions 0 (no diabetes) and 1 (yes diabetes).

enter image description here

When the two distributions overlap completely or to a high degree, that feature will not be very useful. When the distributions are different, that feature is better at discriminating the two final outcomes. In this case feature 33 is more informative than 34. By studying the image one can quickly come up with a simple rule that when 33 is larger than 0, the final outcome is more likely to be 1; when values of 33 are smaller than 0, it is more likely that the outcome will be 0. No such strong rule is obvious for 34, though in that case as well one can make out two weak rules: 1) it is slightly more likely that the outcome is 1 when feature 34 has extreme positive or negative values; 2) it is slightly more likely that the outcome is 0 when feature 34 is just below 0.

All of this is for visualization purposes only - I am not suggesting that you come up with classification rules by eyeballing histograms. Any machine learning method that knows how to deal with feature importance (tree-based methods, L1-regularized regression) will extract feature importance automatically, though maybe not very accurately for only 100 samples.

ADD COMMENT
0
Entering edit mode

Hi Mensur, thanks a lot for the detailed response. Your explanation of how to find an informative feature, made a lot of sense. I had 2 questions - some people have mentioned to me that to solve my problem (i.e. finding the contribution of the different variables on the outcome), multiple regression would be the way to go - so that I would know the % contribution of each variable on the variance in the outcome. Do you think the ML techniques you mentioned are better than multiple regression for this?

Secondly, the feature importance that you mentioned, can it be directly translated as the % contribution of that feature on the outcome?

ADD REPLY
2
Entering edit mode

I think multiple regression is good as a starting point for your problem. Since you are dealing with relatively few samples - and even when one is not - I recommend that you go with regularized regression, as that deals better with bias-variance trade-off. I suggest you read about it here and here, and it probably won't hurt to do some googling on your own on this subject. The gist of it is this: L1-regularized regression (aka Lasso) will tend to shrink regression coefficients of some variables to zero, meaning that it will try to completely eliminate some variables from considerations; L2-regularized regression (aka Ridge) will explicitly assign non-zero coefficients to all variables, so no variable will be completely eliminated; elastic net is a mix of the two.

Linear regression will definitely give you some idea about your variables, and will give you a baseline in terms of modeling. It is impossible to know without trying whether non-linear methods (e.g., tree-based methods such as random forests or gradient-boosted trees) will give a more generalizable model or not, but they might.

In regression models it is not straightforward to come up with a percent contribution, because feature coefficients can be positive, zero, or negative. Still, if for var1, var2 and var3 you end up with coefficients -0.3, 0 and 1.7, respectively, it is obvious than importances are var3 > var1 > var2. Higher absolute values of regression coefficients signify that the associated features contribute more to the final outcome.

ADD REPLY
0
Entering edit mode

Thank you SO MUCH for the detailed explanation. With your explanation, I am all set to explore this topic!

ADD REPLY

Login before adding your answer.

Traffic: 2687 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6