Question

Clustering RNA Seq - No Conditions, No Replicates

1

Entering edit mode

9.7 years ago

mbio.kyle ▴ 380

I have 25 samples from the TCGA, which contain RNA sequencing expression data from 25 different clinical cancer tumor biopsies. I want to cluster them based on similar expression. The problem is that there are no conditions or replicates to build an experiment design to feed into DESeq or edgeR. I also tried things like perls KMeans library, the R built in kmeans() etc. The problem is that for each sample I have 25K expresssion values (25K genes), so the feature vectors are very large and I don't think I am getting anything that is useful.

Does anyone have any advice on clustering data sets with very large feature vectors and/or clustering expression data without biological conditions.

Thanks,

Kyle

RNA RNA-Seq R • 3.6k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by mbio.kyle ▴ 380

Ram · Accepted Answer · 2015-03-22

I'd suggest - first, have a hypothesis. Given what you know about the TCGA biopsies and where they came from, what's your expectation of what a clustering would look like? To get a sense of the data, first make sure the counts are suitably normalized. You could filter the genes to a smaller subset with the highest variance across samples on the log-scale and start clustering with a small set of those genes. Perhaps a few hundred or a thousand, something that is easy to visualize. I would start with using correlation as the similarity measure. Do you you know if the samples came from different clinical sites? Are those sites reflected in the initial clusters? Or the biopsy tissue source? Once you have an initial picture based on a subset of the most variable genes, depending on what you find, you may wish to expand out to include more of the genes, to see what new clusters emerge, if any. Be aware that low-count mRNAs may contribute more noise than signal to your clustering.