Question

How to know the ideal number of clusters for my RNASeq data?

0

Entering edit mode

4.7 years ago

fcamus • 0

Hello, I am using the package MBCluster.Seq for clustering of my RNASeq data. Here I have to manually input the number of clusters I want, but how do I know whats the ideal number? I've had a look around for some methods, and I kind of understand the logic behind choosing the right number, but is there an R package I can use to test it with my data?

Thanks!

RNA-Seq cluster • 1.3k views

ADD COMMENT • link updated 4.7 years ago by ATpoint 86k • written 4.7 years ago by fcamus • 0

2

Entering edit mode

See this link. Seach "Determining Optimal Clusters" in the pdf. The tutorial tells you about three such methods and include the R code for it as well.

ADD REPLY • link 4.7 years ago by ashish ▴ 680

score 1 · Answer 1 · 2020-04-08

There is no gold standard for this. I personally prefer to plot the entire heatmap in the beginning and then look at the data. If it forces you to enter a value then select 1 or 2. From there on it is based on your interpretation. If you feel like more clusters are present based on visual inspection then go higher, if you think new clusters are non-sense and separate arger clusters that look reasonable go lower. Very small clusters are probably not biologically-meaningful as they lack power to find any kind of pathways being enriched in them unless these genes are super important. If you have a cluster of 5 genes but these 5 genes are like JAK1, Jak2, and three STAT genes, well then you probably have quite a meaningful cluster that is related to JAK/STAT signaling and it should demand your attention. Since results must make biological sense I strongly recommend to rank biological evidence over statistical thresholds. Disclaimer: I am a molecular biologist by training who then entered bioinformatics to analyze my own data. A statistician might tell you that you should go for statistical evidence (so a model-driven choice of cluster numbers) rather than trusting your personal interpretation since personal interpretation is biased and statistics is rather not. The point is also that these statistical frameworks weight all genes equally in terms of biological meaning, but this is not biologically true. If we pick up the 5 genes from above, and having these strongly overexpressed and in one cluster is a strong biological phenotype, while the same with a couple of genes which have a modest modulatory but not driving effect is not, at least imho. For me I also change cluster numbers and algorithms until I think that I found the most biologically-meaningful result, but again, this is of course based on my personal standpoint and interpretation, so no guarantee it is meaningful.