Hi,
I have the expression of one gene for 273 glioma patients, as well as their clinical data. I want to do a survival analysis and generate a Kaplan-Meier plot of the patients' survival based on the expression of the gene: "high" or "low". I saw this tutorial on Biostars (Tutorial: Survival analysis with gene expression), and the author takes the Z-score of the expression data to stratify expression as high or low. However, the Z-score is based on the expression of all genes per patient (i.e. taking the average expression and standard deviation of all genes for every patient). Since I don't have the expression of other genes, is it appropriate to take the Z-score for the expression of this gene across all patients (i.e. use the average expression and standard deviation of this gene for all the patients) and stratify high or low expression based on that? Or does survival analysis with gene expression have to be based on the expression of genes per patient, rather than one gene across all patients? I hope this makes sense, please let me know if I need to clarify more.
Thank you!
Hi Kevin, thank you for responding and for your suggestion (and for writing such a great tutorial). Doing quartiles or something similar is easier indeed, but I want to make sure it's appropriate; I will be dividing into quartiles for the expression of this one gene from all the samples, i.e. 'high' will mean high expression relative to other patients, rather than relative to other genes. Is this a conventional way to do survival analysis?
There is no right or wrong way, really. In my tutorial, I first transform the expression data to Z-scores by row (gene), and then perform the 1st pass analysis using the gene Z-scores on the continuous scale. I then identify key genes from this 1st pass and put those into a new Cox model, but encoded this time as
low
|mid
|high
. So, indeed, a gene with a high Z-score has high expression relative to all other genes.In your case, using quartiles, you can just refer to upper-, mid-, and lower- quartiles, and avoid the use of the word 'high' or 'low', if that helps. Indeed, it would not be high relative to the other genes (well, it may be, but we don't know).
You can easily convert a vector into quartiles like this:
Thank you, this is really helpful!