Hi, everyone
I have a dataset on gene expression of cancer patients, which I performed Cox regression for each gene to find their association with overall survival. I came down to a shortlist of around 1,500 candidates. To my mind, if one could show which genes from that shortlist are associated with each other, and then test whether their combination are still associated with overall survival, that would give more meaning.
So, I thought about linear modeling, where all possible combinations of genes shoud be tested. However, I am a bit stalled with these issues:
- linear models are limited to the amount of variables: which means this test cannot be performed on such a big amount of candidates (i.e.
lm()
orglm()
function); - in this case, would linear model (or a variant of it) be the method of choice? Which other method would you recommend?;
- given the experience that a lot of you have here, does my rationale on how to handle this dataset make sense at all?
Any help is much appreciated! Thanks.
Hi again!
This time, going a bit further with the analysis. I have performed the Cox regression with lasso penalty (
glmnet()
) and cross-validated withcv.glmnet()
, and came out with a total of 31 candidates. To my understanding, and by plotting the K-M curves, those candidates show the best association with overall survival (OS). Say, these were the candidates:However, by finding the association of those genes among themselves that would give more meaning to the results. So, my idea was to run a simple multivariate Cox multivariate analysis (
coxph()
) with those candidates by:and found that (for illustration, I will just add those candidates with significant p-values):
The results show that there are 8 candidates (p<0.05) whose expression is associated to each other. Ok, now we have the information about those candidates (1) that are associated with OS and (2) whose gene expression is associated with each other. Based on HR ("exp(coef)") values from that table, these results also show for example that low expression of "X15605_i_at" (HR= 0.59) (gene A) or high expression of "X2000_at" (HR= 1.53) (gene B) are associated with a worse OS. And so forth.
Now, if I combine gene A AND gene B and artificially classify patients with low expression of A and high expression of B as "high risk" (and the other way around as "low risk) I obtain a very clear K-M with the difference between those 2 groups - the "high risk" group indeed has a worse OS compared to the "low risk".
However if I combine all 7 candidates to make my 2 groups of patients, there are only 3 patients (out of 196) that fullfill the criteria.
Finally, my question is: is there a way to do some sort of permutation/combination analysis coupled with Cox regression to find the combination of targets that best associates with OS? Considering that this combination of factors is represented say in at least 30% of my samples.
Sorry for the long post. That was the best way I could find to explain it. And as always, any light shed here will be greatly appreciated.
Thanks!
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.SUBMIT ANSWER
is for new answers to original question.Thank you, genomax. The reason why I posted here it was because this is a follow-up question from the original question in this post. But you are right. Thank you for the info.