I have RNA-seq data with multiple variables. Using DESeq2 I wasn't able to identify DEGs by the variable of interest (Treatment). The PCA revealed the gene expression seems to be affected by two factor variables (Sex and Tissue).
Thereofre I decided to build lasso-penalised model followed by glm regression analysis (Treatment ~ 'all genes' + Sex + Tissue) using the best predictors from lasso model. The code of lasso and glm
was used from Kevin Blighe post. I'm not entirely sure if the performance of the model is actually compromised by the huge number of genes (40k) since no DEGs were found in the first place. And why all predicted genes have Pr(>|z|)
of 1?. I would like to ask the experts here if they could give me an advice on this analysis if it's at all reasonable to do it this way?
Many thanks
I started from the normalized counts as follows:
library("DESeq2")
library("stats")
dds_raw <- DESeqDataSetFromMatrix(countData = Counts,
colData = Meta_data ,
design = ~ 1)
keep <- rowSums(counts(dds_raw)) >= 10
dds_raw <- dds_raw[keep,]
dds_Diff <- DESeq(dds_raw)
vsd_data <- vst(dds_Diff )
mat <- assay(vsd_data)
mat_df <- as.data.frame(t(mat ))
mat_df$SampleID <- rownames(mat_df)
DATA_Model <- join(mat_df, Meta_data , by= 'SampleID' , type = 'left' , match='all')
##### lasso model
DATA_Model_lasso <- DATA_Model[ , - which(names(DATA_Model) %in% colnames(Meta_data ) )]
lassoModel <- glmnet(x=data.matrix(DATA_Model_lasso ), y= DATA_Model$Treatment, standardize=TRUE,alpha=1, family="binomial")
# perform 10-fold cross validation
cv.lassoModel <- cv.glmnet(x=data.matrix( DATA_Model_lasso ),y= DATA_Model$Treatment, standardize=TRUE,alpha=1, nfolds=10,
family="binomial", # multinomial binomial
parallel=TRUE )
# identify best predictors
idealLambda <- cv.lassoModel$lambda.min
idealLambda1se <- cv.lassoModel$lambda.1se
# derive coefficients for each gene
co <- coef(cv.lassoModel, s=idealLambda, exact=TRUE)
co.se <- coef(cv.lassoModel, s=idealLambda1se, exact=TRUE)
######## identify predictors
predict_nonzero <- predict(cv.lassoModel , type="nonzero" , s= idealLambda1se)
predict_nonzero_matrix <- DATA_Model_lasso[, predict_nonzero$s1 ]
Genes_GLM_ready <- colnames(predict_nonzero_matrix)
## biuld glm of the best predictors, using Sex and Tissue as additional variables
formula_full_lasso <-paste('DATA_Model$Treatment', '~' , paste(Genes_GLM_ready ,
collapse = '+'), '+' , 'Sex' , '+' , 'Tissue' )
finalLasso <- glm(formula_full_lasso, data=DATA_Model,family=binomial(link="logit"))
summary(finalLasso)
Output:
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9.2139e+01 on 67 degrees of freedom
Residual deviance: 3.9451e-10 on 0 degrees of freedom
AIC: 136
Number of Fisher Scoring iterations: 25
I think that you still have too many model parameters (i.e. genes) in your final model, which results in model failing to fit - there may be other issues. However, I would need to look at various things in order to correctly diagnose any problem.
The lasso is used to first greatly reduce the number of model parameters to ~5-10, and then use these in a final model together with covariates.
Thanks Kevin. the lasso yielded 25 genes with
alpha=1
. if I use up to 4 genes out of those 25 in the followingglm
, I get the following results which look just fine.What do you think about running
glm
on each of the 25 genes separately. Does that make sense?I don't know how to share the counts file here to reproduce the problem.