I have been looking into how the neighbor joining algorithm works and I have two questions about the first step, i.e. computing the matrix Q using the following formula:
Q(i,j) = (n - 2)d(i,j) - Σd(i,k) - Σd(j,k)
Where both summations are from k = 1 to n, n is the number of taxons, d() is the distance between two elements and i and j are specific taxons.
From my understanding this formula creates a new matrix containing the distance between leaves i and j multiplied by (n-2). From this value we subtract the total distance from i to all other nodes and the total distance from j to all other nodes. This gives us a sense of how far away any given pair of nodes is from all other clusters.
I think we are using the term (n-2) as a scaling factor to ensure that the value calculated by the two sums does not dominate the result. If this is the case, why is it n-2 and not n-1 or just n?
Finally, I have also encountered the following formula for Q:
Q(i,j) = d(i,j) - 1/(n-2)Σd(i,k) - 1/(n-2)Σd(j,k)
Are these two formulas equivalent? If so why?
Thank you.
This is not a scaling factor, it comes from doing the sum over all leaves except i and j.