I have raw read counts of 56 patients who have been ranked with Mandard score as responders and non responders;
I have done differential expression by DESeq2 and I obtained a list of DEGs between responders and non responders; How I can build a model by these genes that is able to predict responders and non responders? I mean How I can train a model by these genes to stratify patients into groups of responders or non-responses.
Gene Fold Change TRG
CHGA -1.5652029 TRG 4-5 Signature
IL1B -1.3159235 TRG 4-5 Signature
CXCL8 -1.231194003 TRG 4-5 Signature
ALB -1.2116047 TRG 4-5 Signature
PTGS2 -1.1674647 TRG 4-5 Signature
CSF3 -1.12757675 TRG 4-5 Signature
OSM -1.113489387 TRG 4-5 Signature
IL6 -1.107777024 TRG 4-5 Signature
ISG15 -1.070607929 TRG 4-5 Signature
IFIT1 -1.0214389 TRG 4-5 Signature
CT45_family -1.0097384 TRG 4-5 Signature
CXCL6 -1.003206221 TRG 4-5 Signature
MAGEA1 -1.001533309 TRG 4-5 Signature
S100A7A 1.0501062 TRG 1-2 Signature
KIF5C 1.146357 TRG 1-2 Signature
HLA-DQA2 1.1829847 TRG 1-2 Signature
KRT14 1.338032 TRG 1-2 Signature
MPPED1 1.3464259 TRG 1-2 Signature
CDK6 1.43895365 TRG 1-2 Signature
KLK5 1.470414 TRG 1-2 Signature
ANKRD30A 1.4728634 TRG 1-2 Signature
CALML5 1.4947388 TRG 1-2 Signature
For example for building a linear regression model I should have hypothesis that could be if the expression of gene A in a patient gets higher than this cut-off, this patient would graded as responders but I don't have any threshould
Please look at the mean of expression of top 20 differentially expressed genes in each group; Their expression seems pretty close to each other so I am not sure how to report a strong signature of genes that are able to predict responders and non-responders
> rowMeans(TRG12)
ALB CALML5 CDK6 CHGA CSF3 CXCL6 CXCL8 IL1B IL6 ISG15 KLK5 KRT14
6.234833 7.638252 10.358273 7.116288 7.646012 9.482543 9.047842 9.803639 7.811549 11.651569 9.054484 11.636721
MAGEA1 OSM PTGS2 S100A7A
6.043237 8.052714 9.228111 7.885017
> rowMeans(TRG45)
ALB CALML5 CDK6 CHGA CSF3 CXCL6 CXCL8 IL1B IL6 ISG15 KLK5 KRT14
6.531966 7.148282 9.771275 8.061407 7.976051 9.833902 9.530818 10.199655 8.471593 12.278263 8.762199 10.925208
MAGEA1 OSM PTGS2 S100A7A
6.209878 8.223946 9.703970 7.565160
Thanks a lot, very helpful as always;
It means that the function cannot find those genes (CDK6 and OAZ1) in your data-frame,
df
. Please check the column names of your data-frame:Hi Kevin,
I tested each of my genes independently in a separate model like
and kept 5 genes as best predictors usin the least p value and AIC by which I built final model using
I obtained this ROC curve from final model
Then I used your tutorial on lasso model in this post
A: How to exclude some of breast cancer subtypes just by looking at gene expressio by which I obtained 3 significant genes from which only CDK6 is common between two models
I obtained this ROC
My question of you please is; How I could know significant predictors from which model is more correct? Also, is it ususl that I am obtaining only 3 significant predictors from 23 differentially genes?
To be honest, without your tutorials I would need months to head around to get idea
Thank you man
Without rigorous testing (on other datasets and on real-life samples), it is difficult to know which model is best. However, in your results, you can present multiple different models and state the AUC from ROC analysis for each.
You should also be applying cross validation to these models (
cv.glm()
) and then checking theDelta
(Δ
) values. For a 'good' model, change inΔ
should be small; however, if the model is 'bad', then change inΔ
will be large.K
here is the number of 'nfolds'. You may want to read more about cross validation in order to understand this. Generally, the minimum value forK
should be 3. WhenK = number of samples in dataset
, it is called 'leave-one-out' cross validation.--------------------------------
Edit: if you used
cv.glmnet()
, then your lasso model is already cross-validated (this may be why there are only 3 predictors). For yourglm()
model from stewpwise regression, however, you need to cross-validate this withcv.glm()
, as I explain above.More here: A: Multinomial elastic net implementation on microarray dataset , error at lognet a
I could not thank you enough. When I am building my data table as you suggest and I put TRG as dependent variable with two levels
TRG12 and TRG45
I am getting a result totally different with when I am putting1 and 0
as levels of my TRG. I guessed for logistic regression numeric or integer values like1 , 0 or YES and NO
would return the same thing. Please correct me if possibleWhen you convert to
1
and0
, it could be that these are coded numerically - they must be encoded as factors. Also, the reference level has to be the same in each case.Sorry Kevin, as I mentioned individual glm gave me 5 genes as significant, I put them in a final model like
But summary is frustrating, says that only one gene is significant while all 5 genes have been selected individually as significant :(
A gene does not require a statistically significant p-value in order for it to provide useful information in a regression model. From the information above, I would say that ISG15 shows a 'trend toward statistical significance'. Did the forward, backward, and 'both' stepwise regression procedures produce the same output?
No Kevin, I only put each of 23 genes individually in
And select ones returning p-value < 0.05 that gave me 5 genes that I put them in a final model
where the summary says only CDK6 is statistically significant
To be honest I am not sure how to judge if a gene could remain as a good predictor or not, I just read lower AIC and p-value and for AIC I am not sure based on which threshold I judge this gene could remain in model or not
Sorry, when you create the model with 23 genes, you must perform stepwise regression on that. Please review the code in my original answer:
If you are not sure about this procedure, then please consult a statistician.
Thank you very much :)
Hi @Kevin. Thank you very much for all your helpful answers. I built a combined model for my data
I got these warnings:
I searched and came across with this link I changed the code to this:
The first warning gone but the second one still remained. Could you please tell how I can fix it?
In layman's terms, it means that there are too many parameters in your model. It could be mitigated by increasing your sample n. Are you aiming to reduce the model parameters?; do you need all of these genes?
Thank you @Kevin! Yes I want to reduce the predictors via stepwise regression as you said. I have 66 samples (46 non-relapse and 20 relapse). These 45 genes are the genes that I got from differential gene expression analysis. I reduced the number of genes to 24 genes, but still get the same warning. What do you think I should do?
In that case, you may consider using, instead, a regularised regression, i.e., with the lasso penalty. Have you used that previously?
Many thanks @Kevin for taking the time to address my questions. No, I've never used it before. I really appreciate if you help me with that.
Hey, there is a really great tutorial, here: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
I have also posted some material on Biostars, so, if you search, you should find it
Thank you Kevin. I'll go through that. Could you please what is the reason that I got this warning and how lasso penalty will solve it?
Basically your previous model was likely 'too good' for the data that you have. This issue can occur when you have too many parameters in the model and / or when you have a small dataset.
Regression with the lasso penalty introduces a penalty / weighting for each parameter, which allows it to work with such problematic models as these. I give a very quick overview here: A: How to exclude some of breast cancer subtypes just by looking at gene expressio
I saw your comprehensive post here about how to build a lasso-penalised model. I did based on your instruction and I got this coefficient for each gene:
These are the result based on metrics: min λ (lamba) . sorry for naive question. I am new to this field. could you please tell me what is the meaning of those dots? In your post you mentioned that "The best predictors will generally be those that have the smallest (possibly zero) coefficient values". You provided two coefficients "co" and "co_se". based on which one we have to select the best predictor? Many thanks!
Hi Kevin Blighe
I have one question, related to the above answer. After cross-validation If I don't have any independent datasets to perform model prediction, will the prediction be valuable and publishable?
Thanks
Should be able to publish it... somewhere. If you're in a reputable group, you can publish even dud [yet sensationalistic] stuff with Nature Publishing Group. In the case where no external dataset is available, one should split the current dataset into training and validation, although the cross validation procedure is essentially doing this for you anyway.