I have 25 samples from the TCGA, which contain RNA sequencing expression data from 25 different clinical cancer tumor biopsies. I want to cluster them based on similar expression. The problem is that there are no conditions or replicates to build an experiment design to feed into DESeq or edgeR. I also tried things like perls KMeans library, the R built in kmeans() etc. The problem is that for each sample I have 25K expresssion values (25K genes), so the feature vectors are very large and I don't think I am getting anything that is useful.
Does anyone have any advice on clustering data sets with very large feature vectors and/or clustering expression data without biological conditions.
Thanks,
Kyle
Thank you for your response. I have two follow up questions.
1. Should I be working with the RPKM values or with counts? (I am using counts since I initally built a count table for DEseq)
2. Should I just use common sense for cutoffs (what is low-count, what is low variance) or is there some standards that people use?
Otherwise I am paring down the list as moving forward with your suggestions. I feel a lot more sane working with a couple thousands genes.
Thanks again!