I have the gene expression data and mortality labels for a group of patients. I have performed differential gene expression analysis and found the top 40 most differentially expressed genes (based on adjusted pvalue). Now I have extracted the gene expression data for these top 40 genes and ran a PCA on it. The PCA results were further clustered using kmeans in R. I am getting two distinct clusters. The details are below:
Dataset description
Total patients: 275
Number of dead patients: 52
Number alive: 223
Clustering results on PCA reduced dataset for the top 40 most differentially expressed genes
Cluster 1: Alive 136 and Dead 1
Cluster 2: Alive: 87 and Dead 51
So my cluster 1 is enriched in patients who are alive. This is good (according to me). What else can I say from this analysis? One conclusion is that top 40 genes' expression data can differentiate between alive and dead patients in the dataset. Now how well does it differentiate? Is there any metric I can attach to these results?
How many dimensions of the PCA are you using for clustering?
I am using the first three dimensions of the PCA