lasso-penalised model and glm to identify best predictors genes.
1
0
Entering edit mode
2.4 years ago
Georges • 0

I have RNA-seq data with multiple variables. Using DESeq2 I wasn't able to identify DEGs by the variable of interest (Treatment). The PCA revealed the gene expression seems to be affected by two factor variables (Sex and Tissue). Thereofre I decided to build lasso-penalised model followed by glm regression analysis (Treatment ~ 'all genes' + Sex + Tissue) using the best predictors from lasso model. The code of lasso and glm was used from Kevin Blighe post. I'm not entirely sure if the performance of the model is actually compromised by the huge number of genes (40k) since no DEGs were found in the first place. And why all predicted genes have Pr(>|z|) of 1?. I would like to ask the experts here if they could give me an advice on this analysis if it's at all reasonable to do it this way? Many thanks

I started from the normalized counts as follows:

library("DESeq2")
library("stats")
dds_raw <- DESeqDataSetFromMatrix(countData = Counts,
                                colData = Meta_data , 
                                design = ~ 1)

 keep <- rowSums(counts(dds_raw)) >= 10
 dds_raw <- dds_raw[keep,]
 dds_Diff <- DESeq(dds_raw)
 vsd_data <- vst(dds_Diff )
 mat  <- assay(vsd_data)
 mat_df <- as.data.frame(t(mat ))
 mat_df$SampleID <- rownames(mat_df)
 DATA_Model <- join(mat_df, Meta_data ,  by= 'SampleID' , type = 'left' , match='all')

 ##### lasso model
 DATA_Model_lasso <-  DATA_Model[ , -  which(names(DATA_Model) %in%    colnames(Meta_data )  )]

 lassoModel <- glmnet(x=data.matrix(DATA_Model_lasso  ), y= DATA_Model$Treatment,  standardize=TRUE,alpha=1,   family="binomial")

   # perform 10-fold cross validation
   cv.lassoModel <- cv.glmnet(x=data.matrix( DATA_Model_lasso ),y= DATA_Model$Treatment, standardize=TRUE,alpha=1,  nfolds=10, 
    family="binomial", # multinomial  binomial
    parallel=TRUE ) 

   # identify best predictors
       idealLambda <- cv.lassoModel$lambda.min
       idealLambda1se <- cv.lassoModel$lambda.1se

    # derive coefficients for each gene
       co <- coef(cv.lassoModel, s=idealLambda, exact=TRUE)
       co.se <- coef(cv.lassoModel, s=idealLambda1se, exact=TRUE)

     ######## identify predictors
     predict_nonzero <- predict(cv.lassoModel , type="nonzero" , s= idealLambda1se)
     predict_nonzero_matrix <- DATA_Model_lasso[, predict_nonzero$s1 ]
     Genes_GLM_ready <- colnames(predict_nonzero_matrix)

      ## biuld glm of the best predictors, using Sex and Tissue as additional variables
       formula_full_lasso <-paste('DATA_Model$Treatment', '~' , paste(Genes_GLM_ready , 
        collapse = '+'), '+'  , 'Sex' , '+'  , 'Tissue'  )

      finalLasso <- glm(formula_full_lasso, data=DATA_Model,family=binomial(link="logit"))

       summary(finalLasso)

Output:

enter image description here

(Dispersion parameter for binomial family taken to be 1)

 Null deviance: 9.2139e+01  on 67  degrees of freedom
 Residual deviance: 3.9451e-10  on  0  degrees of freedom
 AIC: 136

 Number of Fisher Scoring iterations: 25
lasso-penalised glm DESeq2 model R • 2.0k views
ADD COMMENT
0
Entering edit mode

I think that you still have too many model parameters (i.e. genes) in your final model, which results in model failing to fit - there may be other issues. However, I would need to look at various things in order to correctly diagnose any problem.

The lasso is used to first greatly reduce the number of model parameters to ~5-10, and then use these in a final model together with covariates.

ADD REPLY
0
Entering edit mode

Thanks Kevin. the lasso yielded 25 genes with alpha=1 . if I use up to 4 genes out of those 25 in the following glm, I get the following results which look just fine.

enter image description here enter image description here

What do you think about running glm on each of the 25 genes separately. Does that make sense?

I don't know how to share the counts file here to reproduce the problem.

ADD REPLY
1
Entering edit mode
2.4 years ago
LChart 4.7k

Your penalization parameter is not nearly strong enough. One of the reasons to use LASSO is its ability to shrink coefficients to 0. This is not happening for you. The presence of standard errors on the order of 10^6 (which is where Z=0/P=1 comes from) basically means you have more non-zero coefficients than you do samples, so the effective condition number is blowing up. You need to tune your lambda up much much higher to start selecting small models.

ADD COMMENT
0
Entering edit mode

Thanks LChart I tuned lambda to higher values and got less number of parameters (genes). However, processing more thank 4 ~ 5 genes in the glm will produce the same problem (very high Std. Error as you pointed out). please see the comment above with screenshot.

ADD REPLY
0
Entering edit mode

I'm not deeply familiar with the underlying test for the LASSO; but my strong suspicion is that after the top handful of genes, there is a large pool of "equivalent" candidates whose inclusion results in very similar values for the objective function -- and at this point there can be no "certainty" as to the best model (and the features it contains).

ADD REPLY

Login before adding your answer.

Traffic: 1578 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6