I have a set of biomarkers (genes) and I'd like to validate it analysing the Kaplan Meier survival curve. I had a look at the literature and most of the time this curve is calculated using the expression value of a single gene. The patients are divided in two groups: low expression and high expression, often using and arbitrary expression value threshold as the mean. Finally using those two group the curve is plotted. What is a common practice to do when the number of genes is higher than one? How the patients are assigned to the two groups?
Furthermore, is there a simple tool that allows me, given:
- list of genes
- expression values of those genes
- survival information of the patients
to create/plot the Kaplan Meier curve?
Thanks
I would suggest taking your gene set and doing an unsupervised clustering of your patients using gene expression profiles (e.g. K-means with k=2 or k=3). Then plot separate Kaplan Meier curves for your clusters and perform routine statistical tests for them. I guess that would be easy to do using a custom script in R.
But what would be the meaning of each group? Let's suppose k=2 and there is a significant difference between the 2 KM curves? What can I conclude? What is the value of such division in groups?
Anything of this kind that could predict clinical outcome is of a high value itself. The interpretation should be based on the up/down states of genes in those samples. I know that is quite a heuristic approach. Yet for example in breast cancer classification one has ER+, triple negative, etc - multiple subtypes based on combinations of expression of 3 receptor genes and it works fine.
You may want to checkout this website that allows you to easily plot KM plots (based on their data-sets)