I have a Matrix with Proteins in the rows and conditions in the columns. The values are relative changes in the Protein abundance compared to a standard condition. E.g. 1 -> no change; 1,5 -> 50% more; 0,5 -> 50% less The values range from 0 up to 6000, but are mostly in the range around one. The matrix can also be log2 normalized to retrieve a normal distributed data.
My goal is to find Protein cluster, that have a similar expression behavior over the conditions. I worked with Python and Sklearn clusters. First I tried to use kmeans. I had to log2 transform to data to decrease the influence of the outliers. But I still get clusters seperated by their change values, not their shape. In the example pictures you can see, that Proteins, that move around 0 (or 1 without log2) are clustered together, but Proteins with higher fold changes are seperated from the others. https://pl.vc/pjfik / https://pl.vc/8dggb
Then I tried Agglomerative Clustering. Here I also had to log2 transform to reduce the separation of the outliers. I used the "Cosine" Metric (http://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_clustering_metrics.html#example-cluster-plot-agglomerative-clustering-metrics-py) "The cosine distance is invariant to a scaling of the data". The clusters look better, but there are still huge clusters with values around 0 (or 1 without log2). https://pl.vc/q0ikc / https://pl.vc/1jh54b
Are there Cluster Algorithms that are specific for that, maybe developed for biological meaning? Extra Question: I thought about using Bicluster Algorithms. Which could be the right one and are there implementations for Python or maybe R, etc.. ?