I'm currently using Algorithm::Cluster, which is based on the C Clustering Library, to cluster sequences and structures in Perl. Algorithm::Cluster provides many clustering facilities, including hierarchical clustering. Given the desired number of clusters, it builds the tree and cuts it. What I need, however, is a library that allows for a threshold. Something like: all of the members of one cluster are <= X distance apart, or: any two members of different clusters are >= X distance apart.
Is this possible in Algorithm::Cluster? Or is there another (Perl) module that would, given a distance matrix and a threshold, determine the appropriate number of clusters and their members?
Those interested in a quick Pure Perl solution can use this example which uses some undocumented XS interfaces:
sub cutthresh {
my ($tree, $thresh)= @_;
my @nodecluster;
my @leafcluster;# Binary tree: number of internal nodes is 1 less than # of leafs# Last node is the root, walking down the tree
my $icluster= 0;# Root node belongs to cluster 0$nodecluster[@doms-2]=$icluster++;for(my $i= @doms-2;$i>= 0;$i--){
my $node=$tree->get($i);
say sprintf "%3d %3d %.3f", $i,$nodecluster[$i], $node->distance;
my $left=$node->left;# Nodes are numbered -1,-2,... Leafs are numbered 0,1,2,...
my $leftref=$left< 0 ? \$nodecluster[-$left-1]: \$leafcluster[$left];
my $assigncluster=$nodecluster[$i];# Left is always the same as the parent node's cluster
$$leftref=$assigncluster;
say sprintf "\tleft %3d %3d", $left, $$leftref;
my $right=$node->right;# Put right into a new cluster, when thresh not satisfiedif($node->distance >$thresh){$assigncluster=$icluster++ }
my $rightref=$right< 0 ? \$nodecluster[-$right-1]: \$leafcluster[$right];
$$rightref=$assigncluster;
say sprintf "\tright %3d %3d", $right, $$rightref;}return @leafcluster;}
Do you have to use Perl? I am a huge fan of Perl but for these sort of tasks I would use R. I have used this website to learn clustering in R. You can still use Perl to connect with R if you have to. I played with RSPerl for a while but in the end it was easier for me just to use R scripts.
The pure Perl version of this has now been implemented as http://p3rl.org/Algorithm::Cluster::Thresh for those who are interested.