reduce Newick tree and maintain greatest diversity in subset
2
2
Entering edit mode
9.1 years ago
voynich ▴ 20

I have several thousand sequences, which have been aligned and used to generate a Newick phylogeny tree. I want to find a smaller subset of sequences, X, such that the remaining sequences are still distributed uniformly across the phylogenic space.

So far I'm thinking about determining the number of branches at some distance from the root, until the number of branches is close to X. Then for each of these branches, I choose one tip sequence at random and delete the rest.

I am working with Python, but any pointers would be helpful. Thanks!

phylogeny tree • 2.4k views
ADD COMMENT
0
Entering edit mode

Did you find a solution? Would be very interested to hear how you achieved this!

ADD REPLY
1
Entering edit mode
9.1 years ago

This seems like it could be solved by a MST (minimum spanning tree)-like algorithm. Each iteration, find the shortest edge (distance between two leaves). Since that edge will join 2 nodes, you need to not only remove the edge, but discard one of the nodes, ideally the one closer their mutual next-closest node.

Iterate until you have the number of nodes you want. Storing edges in a heap (or priority queue) is often useful in this kind of scenario.

ADD COMMENT
1
Entering edit mode
9.1 years ago
Biojl ★ 1.7k

You can work it out with ete2: http://etetoolkit.org/

It's python based and will let you parse nodes and leaves and calculate all sort of measures, it has some very interesting built-in functions.

ADD COMMENT

Login before adding your answer.

Traffic: 2406 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6