I want to cluster genes based on a expression matrix and annotation. With unsupervised learning algorithms (Hierachical, K-means,...), the clusters are only based on the correlation of the gen expression. So my idea was to supervise the learning with annotation data, which could lead to more meaningful clusters (depends on the annotation input). Random Forest is used on expression data, but I only found examples with two classifications, that were both included in the training set. I also want to find new clusters, that didn´t exist in the training set. So the algorithm is trained with an expression matrix (feature), some annotations (feature) and many clusters of different sizes (classifications). On the test data, the algorithm should decide, which genes belong together in a cluster. It would find clusters that already existed in the training but also new clusters. Is Random Forest the right algorithm for that?
I spent some time trying to understand what you are doing here, I use random forest quite a bit.
It just isn't clear. If you could describe your pipeline as a series of steps 1. xxx 2. yyy and so on, it might be easier to understand.
I also don't see why you want to combine 'annotation data' with gene expression data, is it clinical data? It is probably relatively very low dimensional compared to gene expression data.
What I understand from this is that you're trying to do classification with a non-exhaustive training set (i.e you have unknown classes). This is an interesting machine learning problem but doesn't have a simple solution. The standard approach in this situation is indeed to go with unsupervised methods i.e. clustering algorithms. However, another way could be to use a supervised approach to identify all items belonging to the known classes and deal with the remaining ones separately. Finally, you could also look into a Bayesian approach such as in this paper.