Question

Data structure for phylogeny data

0

Entering edit mode

10.3 years ago

Quak ▴ 520

I would like to store/represent a phylogeny tree data, in a way that, I can trace back and see,

1) which node is the ancestor of a give node; 2) which nodes are descendants of a give node;

I have made a distance matrix similar to the following sketch

for example to say that H is an ancestor of A; and I and C are descendants of L

Is there any common data structure for such things?

data-structure phylogeny • 2.8k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Quak ▴ 520

0

Entering edit mode

I don't understand the question, what is wrong with the representing the tree as a series of vertices that have a reference to their child nodes and to their parent node. To know the ancestor, take a node and. while you don't reach the root, go to the parent node recursively. For subtrees, just use recursion on the child nodes. while you do not reach a leaf.

Maybe I misunderstood the question.

ADD REPLY • link 10.3 years ago by Gabriel R. ★ 2.9k

Ram · Answer 1 · 2014-08-24

0

Entering edit mode

10.3 years ago

pld 5.1k

Yes, you have an image of one in your post: a tree. The distance matrix in your post (almost) implicitly describes a tree, since it can be used to construct a directed acyclic graph, although it might take a bit more effort since a phylogenetic tree and a distance matrix don't exactly translate.

You could also use disjoint sets, but a tree is a more obvious approach.

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by pld 5.1k

0

Entering edit mode

excuse me for my ignorance - how can I address item 1 and 2 (I listed in the question) from the distance matrix ?

ADD REPLY • link 10.3 years ago by Quak ▴ 520

0

Entering edit mode

It won't be so easy, a phylogenetic tree has one edge per child, but your distance matrix has multiple edges per child. E.g. B is a child of K, but your distance matrix says that there is an edge {B,A}. This means that given a distance distance matrix, it may be possible to generate more than one tree. You'll have to do some work in figuring out which edges will be used for the tree.

This isn't exactly an easy problem, check out neighbor joining: http://en.wikipedia.org/wiki/Neighbour_joining.

What might be an easier solution is to obtain a text representation of the tree from your software and parse/process that. This text representation stores the information you want. Many programs that calculate phylogeny will store trees as text.

The easiest option is to just look at the tree, unless you're working with an enormous phylogeny, I think it'd be easiest to look at the tree. You never want to not look at the tree, you should always check the whole tree to make sure things are behaving in ways that make sense. As far as I know, Bayesian, Neighbor Joining and ML methods for tree construction are not guaranteed to converge on the correct tree.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by pld 5.1k