PCA of the top variable genes didn't reveal any grouping of the samples (they are all in one cluster). Therefore, I am looking for alternative ways to derive a grouping of the samples. I am not sure what the best approach is for each of the three steps.
1. Perform unsupervised clustering on bulk RNAseq data to derive molecular subtypes
- Correcting for library size and variance stabilized transformation (Deseq2)
- Gene selection (e.g. by variance, uni modality test)
- Apply kmeans/hierarchical clustering algorithm on distance matrix
- Decide for the best number of clusters using e.g. sum of squared error (SSE) scree plot and/or based on correlation with clinical variables
==> What other preprocessing steps are recommended for clustering? E.g. Z score, quantile normalization?
2. Extract a gene signature that describes each of the clusters
Look for significant gene expression differences between cluster using likelihood ratio test (Deseq2), and manually select based on heatmap ==> Is there a better/easier way to do this?
3. Classify a 2nd independant bulk RNAseq dataset (different sequencing protocol) using the gene signature
Clustering of the genes in the gene signature using number of clusters preprocessing steps from step 1 and manually assign cluster name based on heatmap ==> Is there a better/easier way to do this?
What sample sizes are you analyzing? The approach will depend on the scale of the study.
PCA didn't reveal any clustering.
Roughly 200 samples