Robinsons Foulds Example Calculation
4
3
Entering edit mode
11.4 years ago
amod47463 ▴ 30

Hi,

I'm just trying to work out quickly on paper the calculations behind the robinson foulds distance algorithm (not the symmetric distance) but rather the definition below from dendropy:

"This method returns the Robinsons-Foulds distance between two trees, i.e., the sum of the square of differences in branch lengths for equivalent splits between two trees, with the branch length for a missing split taken to be 0.0."

Assuming I had the two trees:

s1 = "((t5:0.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):0.028969):0.065840,t3:0.170221):0.383247)"
s2 = "((t5:2.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):1):0.065840,t3:0.170221):0.383247)"

So the clusters are:

(t1,t2)
(t1, t2, t3, t4)
(t1, t2, t3, t4, t5, t6)
(t1, t2, t4)
(t5, t6)

The branch lengths are the same in both trees with the exception of the 2.161175 (tree 2) and 0.161175 in tree1 (occurring twice). I already know from using the dendropy package that the correct answer is 2.971031.

However from the above input data and the two input trees, I would have thought that all equivalent splits end up producing a value of zero and 2.161175 - 0.161175 squared (occurring twice) would give a value 8.

I'm clearly incorrect but am not sure what I'm misunderstanding (maybe its my understanding of 'equivalent' splits). A worked example would be really helpful so any help would be appreciated.

Many thanks in advance.

phylogenetics distance • 8.8k views
ADD COMMENT
2
Entering edit mode
11.4 years ago
Joseph Hughes ★ 3.0k

According to Splitstree, these are the different splits for your two trees:

Tree S1:

0.161175 Split t5,

0.161175 Split t5 t4 t2 t1 t3,

0.104381 Split t5 t6 t2 t1 t3,

0.075411 Split t5 t6 t4 t1 t3,

0.075411 Split t5 t6 t4 t2 t3,

0.028969 Split t5 t6 t4 t3,

0.06584 Split t5 t6 t3,

0.170221 Split t5 t6 t4 t2 t1,

0.77554 Split t5 t6,

Tree S2:

2.161175 Split t5,

0.161175 Split t5 t4 t2 t1 t3,

0.104381 Split t5 t6 t2 t1 t3,

0.075411 Split t5 t6 t4 t1 t3,

0.075411 Split t5 t6 t4 t2 t3,

1.0 Split t5 t6 t4 t3,

0.06584 Split t5 t6 t3,

0.170221 Split t5 t6 t4 t2 t1,

0.77554 Split t5 t6,

So Split t5, and Split t5 t6 t4 t3, are different. Calculate the absolute difference:

Split t5 => 2.161175 - 0.161175 = 2

Split t5 t6 t4 t3 => 1 - 0.028969 = 0.971031

Sum = 2.971031

ADD COMMENT
0
Entering edit mode

Originally posted as an answer by amod47463; deleted and pasted here as a comment

Hi Joseph,

Thanks for the reply. Just to clarify: Do the workings above refer to the "euclidean distance" which is defined on the dendropy website (http://pythonhosted.org/DendroPy/tutorial/treestats.html) as:

"The sum of absolute differences in branch lengths for equivalent splits between two trees, with the branch length for a missing split taken to be 0.0."

Would you say this is a correct definition of the above calculation (it seems to be)? In the dendropy example however they give an answer of 2.2232636377544162 for euclidean distance.

On the other hand they define the Robinson Foulds algorithm as:

"the sum of the square of differences in branch lengths for equivalent splits between two trees, with the branch length for a missing split taken to be 0.0."

I would take this (based on the above) to be:

Split t5 => 2.161175 - 0.161175 = 2^2 = 4

Split t5 t6 t4 t3 => 1 - 0.028969 = 0.971031^2 = 0.94290120

Sum = 4.94290120

Any further clarification greatly appreciated.

ADD REPLY
0
Entering edit mode

The value of 2.971031 corresponds to absolute differences. The value of 2.223264 (i.e. sqrt(4.94290120)) corresponds to the euclidean distance as explained by David Winter. It would be great if you got in touch with the Dendropy developers to ask them to make the relevant clarifications.

ADD REPLY
1
Entering edit mode
11.4 years ago
ross.mounce ▴ 10

Robinson-Foulds distance, as I understand it has nothing to do with branch lengths. It's a topology comparison _only_ statistic. It ignores branch length. http://en.wikipedia.org/wiki/Robinson%E2%80%93Foulds_metric

ADD COMMENT
0
Entering edit mode
11.4 years ago
David W 4.9k

I think you've found a disconnect between the docs and the code here, they may say Dendropy is using euclidean distances for weighted RF, but the results it returns are based on absolute differences.

This is easy to show in the test case (but would be harder if the edges didn't line up so nicely) using R:

library(ape)
tr1 <- read.tree(text= "((t5:0.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):0.028969):0.065840,t3:0.170221):0.383247);")
tr2 <- read.tree(text= "((t5:2.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):1):0.065840,t3:0.170221):0.383247);")

dist(rbind(tr1$edge.length, tr2$edge.length), method="euclidean")
##         1
## 2 2.223264

dist(rbind(tr1$edge.length, tr2$edge.length), method="manhattan")
##1
##2 2.971031

The only problem is that I have no idea which of the these is _real_ RF-weighted distance! The paper is here but behind a pay wall for me. As Ross suggests, when people talk about Robinson-Foulds distance they usually mean the topology-only version, so it's hard to find other implementations.

Certainly something to email the Dendropy devs about, anyway.

EDIT

I think I was thrown off my the use of the word "euclidean" in comments. The value you've correctly calculated is the sum of squared differences,

sum(abs(tr1$edge.length- tr2$edge.length)^2)
## 4.942901

which is different from the euclidean distance, which is the square root of the sum of squared differences:

sqrt(sum(abs(tr1$edge.length- tr2$edge.length)^2))
## 2.223264
ADD COMMENT
0
Entering edit mode

Hi David,Joseph

Thanks a million for the replies. They're very helpful. There seem to many many robinson foulds definitions! I will indeed contact the dendropy developers.

On a separate note could the logic behind the splits please be explained? My reasoning for listing the splits according to the below (as per my original post) was based on reading about the tree clusters/clades and subclades.

(t1,t2) (t1, t2, t3, t4) (t1, t2, t3, t4, t5, t6) (t1, t2, t4) (t5, t6)

While it's clear the splits generated from splitstree software are correct (as the subsequent values do indeed calculate the correct distance values), I currently don't understand where they come from.

Thanks for the help so far.

ADD REPLY
1
Entering edit mode

You will find a lot of information in the Splitstree manual. Essentially, the deletion of any branch (or edge) will give rise to a split. In your example you have 9 branches so nine splits and the splits can be defined in two ways. For example split t5 could be written as t6 t3 t4 t2 t1. Split t5 corresponds to the deletion of the branch leading to t5, split t5 t4 t2 t1 t3 corresponds to the deletion of the branch leading to t6. I will leave it to you to work out the rest.

ADD REPLY
0
Entering edit mode

Hi Joseph,

Many thanks for the reply. Apologies for resurrecting an old thread but we have been working at understanding this:

So from the tree below:

tree="((t5:2.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):1):0.065840,t3:0.170221):0.383247)"

We understand the splits to be as follows:

T5, T6, T4, T2, T1, T3, split T4,2,1, split T2,1, Split 4,3,2,1, i.e. 9 splits in total

From the above newick tree the distances can be found as follows:

2.161175 = split 5

0.161175 = split 6

0.104381 = split 4

0.075411 = split 2
0.075411 = split 1
0.06584 = split 4,2,1   
1.0 Split = split 2,1
??? = split 4,3,2,1

Using the program "splittree", the following results are obtained:

2.161175 Split t5,

0.161175 Split t5 t4 t2 t1 t3,

0.104381 Split t5 t6 t2 t1 t3,

0.075411 Split t5 t6 t4 t1 t3,

0.075411 Split t5 t6 t4 t2 t3,

1.0 Split t5 t6 t4 t3,

0.06584 Split t5 t6 t3,

0.170221 Split t5 t6 t4 t2 t1,

0.77554 Split t5 t6

Our outstanding questions are:

  • Firstly, why are the splits named as such (opposite) by splittree, in particular Split T5 is different to the rest?
  • Secondly, How is the value .77554 obtained for the final split? It appears to be 0.383247 + 0.392293, however we are unsure of the logic behind this?

Thanks in advance

ADD REPLY
0
Entering edit mode

All fully bifurcating trees can be represented as a binary matrix where 1s show the tips that subtend from each node. I think (but best check with the developers of Splitstree) that the splits are based on this. The matrix for the above tree looks like this:

    t5 111111111
    t6 001111111
    t4 010111010
    t2 011010010
    t1 011100010
    t3 011111100

So the splits are as you describe. The binary matrix is worked out based on the first taxon in the tree file. So you could rewrite the tree as:

(((t4:0.104381,(t2:0.075411,t1:0.075411):1):0.065840,t3:0.170221):0.383247,(t5:2.161175,t6:0.161175):0.392293)

and you would get the binary matrix:

t4 111111111
t2 001011111
t1 010011111
t3 011100111
t5 011101010
t6 011101100

and the following splits

t4                   0.104381
t4,t1,t3,t5,t6       0.075411
t4,t2,t3,t5,t6       0.075411
t4,t2,t3,t5,t6       1.0
t4,t2,t1             0.06584
t4,t2,t1,t5,t6       0.170221
t4,t2,t1,t3,t6       2.161175
t4,t2,t1,t3,t5       0.161175
t4,t2,t1,t3          0.77554

I hope this helps.

ADD REPLY

Login before adding your answer.

Traffic: 1628 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6