Clustering RNA Seq - No Conditions, No Replicates
1
1
Entering edit mode
9.7 years ago
mbio.kyle ▴ 380

I have 25 samples from the TCGA, which contain RNA sequencing expression data from 25 different clinical cancer tumor biopsies. I want to cluster them based on similar expression. The problem is that there are no conditions or replicates to build an experiment design to feed into DESeq or edgeR. I also tried things like perls KMeans library, the R built in kmeans() etc. The problem is that for each sample I have 25K expresssion values (25K genes), so the feature vectors are very large and I don't think I am getting anything that is useful.

Does anyone have any advice on clustering data sets with very large feature vectors and/or clustering expression data without biological conditions.

Thanks,

Kyle

RNA RNA-Seq R • 3.6k views
ADD COMMENT
2
Entering edit mode
9.7 years ago
Ahill ★ 2.0k

I'd suggest - first, have a hypothesis. Given what you know about the TCGA biopsies and where they came from, what's your expectation of what a clustering would look like? To get a sense of the data, first make sure the counts are suitably normalized. You could filter the genes to a smaller subset with the highest variance across samples on the log-scale and start clustering with a small set of those genes. Perhaps a few hundred or a thousand, something that is easy to visualize. I would start with using correlation as the similarity measure. Do you you know if the samples came from different clinical sites? Are those sites reflected in the initial clusters? Or the biopsy tissue source? Once you have an initial picture based on a subset of the most variable genes, depending on what you find, you may wish to expand out to include more of the genes, to see what new clusters emerge, if any. Be aware that low-count mRNAs may contribute more noise than signal to your clustering.

ADD COMMENT
0
Entering edit mode

Thank you for your response. I have two follow up questions.

1. Should I be working with the RPKM values or with counts? (I am using counts since I initally built a count table for DEseq)

2. Should I just use common sense for cutoffs (what is low-count, what is low variance) or is there some standards that people use?

Otherwise I am paring down the list as moving forward with your suggestions. I feel a lot more sane working with a couple thousands genes.

Thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 2121 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6