Hi,
Some of you may know aout the Neighbor-joining (NJ) algorithm, which, given a set of taxa and a distance matrix, essentially does the following :
- find the best two taxa A,B to join under a common parent P
- update the distances from all taxa to P
- repeat until a binary tree is obtained
This works if we're given all the taxa at first. Each time we form a taxa cluster (i.e. a subtree), the other distances to this subtree are updated and everything goes well.
Now, I'm given a set of fixed taxa clusters and I'm asked to join them in an order as what NJ would do.
For each cluster, I have access to the set of taxa they contain (i.e. the leaves), but not to their topology.
Thus, I can't really compute the distance between two clusters like NJ, because NJ updates the distances at each
joining, iteratively. Since I'm given the clusters without the topology, I don't know in which order the joins occurred.
So my question is : given a set of taxa clusters T for which I only know of the underlying taxa, and given a taxa distance-matrix, is there some way I can know of the best two clusters of T to join according to NJ ?
Can't you calculate the pairwise distances between all the clusters and join those with the least distance first, then recalculate the distances between the newly formed cluster and the old ones, find the two with the least distance, connect ... and so on ?
Yes, actually, that is my problem. If I can calculate the distance between any pair of clusters, problem solved. But as I understand it, all the distances in NJ get updated each time we join two vertices together. Therefore, the distance between two clusters seems to depend on the order in which we have previously joined vertices together, which I don't have access to. If there's a way to get the distance between two clusters in NJ only knowing the distances between the leaves, I'm interested. Otherwise, I think I'll try to get access to the topology of my clusters, and sort of reproduce the NJ algorithm on all the subtrees to get updated distances.