Question

PCA-based clustering gene expression data

0

Entering edit mode

6.1 years ago

Gene_MMP8 ▴ 240

I have the gene expression data and mortality labels for a group of patients. I have performed differential gene expression analysis and found the top 40 most differentially expressed genes (based on adjusted pvalue). Now I have extracted the gene expression data for these top 40 genes and ran a PCA on it. The PCA results were further clustered using kmeans in R. I am getting two distinct clusters. The details are below:
Dataset description
Total patients: 275
Number of dead patients: 52
Number alive: 223
Clustering results on PCA reduced dataset for the top 40 most differentially expressed genes
Cluster 1: Alive 136 and Dead 1
Cluster 2: Alive: 87 and Dead 51
So my cluster 1 is enriched in patients who are alive. This is good (according to me). What else can I say from this analysis? One conclusion is that top 40 genes' expression data can differentiate between alive and dead patients in the dataset. Now how well does it differentiate? Is there any metric I can attach to these results?

RNA-Seq R • 1.3k views

ADD COMMENT • link updated 6.1 years ago by Kevin Blighe 89k • written 6.1 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

How many dimensions of the PCA are you using for clustering?

ADD REPLY • link 6.1 years ago by Asaf 10k

0

Entering edit mode

I am using the first three dimensions of the PCA

ADD REPLY • link 6.1 years ago by Gene_MMP8 ▴ 240

score 5 · Accepted Answer · 2019-07-16

You should qualify your statements / result with:

Chi-squared test to check relative proportions of alive | dead in each cluster
multivariate binary logistic regression model with all 40 genes as predictors and alive|dead as end-point. This would be followed by cross validation of the model and then ROC analysis. Prior to cross validation and ROC, you could aim to reduce the model parameters via stepwise regression or manual inspection of the errors, residuals, p-values, etc.
hierarchical clustering using just the 40 genes for the purpose of gauging separation of samples

Kevin