Question

How to identify best predictors in lasso-penalised model?

0

Entering edit mode

5.3 years ago

Raheleh ▴ 260

I have 66 sample ( Relapse, Non-Relapse). I did dfferential gene expression analysis using limma package and then I made a lasso-penalised model based on 45 DEGs using the instruction that is prepared here.

These are the result based on metrics: min λ (lamba) .

$`non-Relapse`
46 x 1 sparse Matrix of class "dgCMatrix"
                       1
(Intercept)   9.97551314
ASPH          .         
ATP5A1        .         
ATP5G3        .         
COL24A1       0.05181047

DCAF10        .         
DES           .         
DNMT1         .         
EIF2B2        .         
ETS2          .         
FLJ38773      .         
FLNA         -0.72130732

HCAR1        -0.08396882

HK2           .         
HMGB2         .         
HNRNPDL       .         
HSPA1A        0.05883314

HSPE1         .         
ILF3          .         
IQGAP3        .         
IRF2BPL       .         
JCHAIN        0.05657491

KPNA2         .         

LOC102546294 -0.09527344

MAP7D2       -0.10777351

MED31         .         
MSL1          .         
OLFM4         .         
PCK1          .         
PHYKPL        .         
PROSC         .         
PTRF          .         
REG3A         .         
RNF7          .         
SEC22B        .         
SF1           .         
SLC2A3       -0.15121214

SOX9          0.02238456

SRI           .         
STRN3         .         
TM2D1         .         
TPT1          .         
UQCRFS1       .         
ZCCHC8        .         
ZNF638        .         
ZNF761        .         


$Relapse
46 x 1 sparse Matrix of class "dgCMatrix"
                       1

(Intercept)  -9.97551314

ASPH          .         
ATP5A1        .         
ATP5G3        .         
COL24A1      -0.05181047

DCAF10        .         
DES           .         
DNMT1         .         
EIF2B2        .         
ETS2          .         
FLJ38773      .         

FLNA          0.72130732

HCAR1         0.08396882

HK2           .         
HMGB2         .         
HNRNPDL       .         
HSPA1A       -0.05883314

HSPE1         .         
ILF3          .         
IQGAP3        .         
IRF2BPL       .         
JCHAIN       -0.05657491

KPNA2         .         
LOC102546294  0.09527344

MAP7D2        0.10777351

MED31         .         
MSL1          .         
OLFM4         .         
PCK1          .         
PHYKPL        .         
PROSC         .         
PTRF          .         
REG3A         .         
RNF7          .         
SEC22B        .         
SF1           .         
SLC2A3      0.15121214

SOX9         -0.02238456

SRI           .         
STRN3         .         
TM2D1         .         
TPT1          .         
UQCRFS1       .         
ZCCHC8        .         
ZNF638        .         
ZNF761        .

sorry for naive question. I am new to this field. Can any one help me with these coefficient results that I got for genes? I studied this page but I couldn’t figure out what is the meaning of these dots. My second question is that based on which coefficient I have to select the best predictors; min λ (lamba) or 1 standard error of λ? I really appreciate any helps.

lasso-penalised Coefficient predictor • 1.3k views

ADD COMMENT • link updated 5.3 years ago by Jean-Karim Heriche 27k • written 5.3 years ago by Raheleh ▴ 260

score 1 · Answer 1 · 2019-09-16

1

Entering edit mode

5.3 years ago

Jean-Karim Heriche 27k

With glmnet, you can get a plot of the coefficients as a function of regularization (i.e. coefficients vs L1 norm) as explained in the package vignette. Interpreting this plot is key to inferring the most relevant features. A small L1 norm corresponds to a strong regularization (0 corresponds to a model with no features). As regularization is relaxed, the L1 norm increases and more features get added to the model. So more important features can be seen as entering the model early, that is they start having non-zero coefficients for low L1 norm.
Since glmnet returns a set of models for a range of λ, cross validation is used to find the best λ. Two values are flagged by default: λmin corresponds to the minimal error and λ1se corresponds to a model with error within one standard error of the minimum error. Both are reasonable values to choose but you can also choose another one depending on the problem.

ADD COMMENT • link 5.3 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Many thanks Jean-Karim for your explanation. I still couldn't understand the meaning of those dots . could you please explain more? I am looking for the best predictor genes which can predict patients at risk of relapse. Based on the result that I got above, which genes are the best predictors? Many thanks!

ADD REPLY • link 5.3 years ago by Raheleh ▴ 260

1

Entering edit mode

The dots indicate features (here genes) that didn't make it into the model (i.e. coefficient is 0). The magnitude of the coefficient indicates the "strength" of the contribution of the corresponding gene. So a greater positive value means a stronger positive influence.

ADD REPLY • link 5.3 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thank you for your reply. But, I got confused. You say " a greater positive value means a stronger positive influence."

If I'm not mistaken you mean a gene with a greater positive value of coefficient (for example in my case, FLNA and SLC2A3 and MAP7D2 respectively in relapse data) is a better predictor. But @Kevin in this post mentioned that "The best predictors will generally be those that have the smallest (possibly zero) coefficient values". could you please help me to get out of this confusion? I really appreciate your time and help.

ADD REPLY • link 5.3 years ago by Raheleh ▴ 260

0

Entering edit mode

a gene with a greater positive value of coefficient (for example in my case, FLNA and SLC2A3 and MAP7D2 respectively in relapse data) is a better predictor.

Yes

The best predictors will generally be those that have the smallest (possibly zero) coefficient values

On the face of it, this seems wrong as a coefficient of 0 means the gene doesn't contribute to the model but maybe this comment was made with something else in mind.

ADD REPLY • link 5.3 years ago by Jean-Karim Heriche 27k