Parsing A Tree File

Entering edit mode

13.8 years ago

User 1940 ▴ 80

Hi,

Is there any script template that parses a tree into its subtrees while showing all subchild / subparent nodes?

For example, for this following tree:

((ar, tk) 23 , (abc, ((xa, xb) 26 , ((lima, limb) 28 , (((spi, sfi) 31 , (eqc, (kme, ame) 33 ) 32 ) 30 , ((fxf, (tva, tvb) 36 ) 35 , (fxe, (cjk, ((mrb, (qq, rr) 41 ) 40 , sgn) 39 ) 38 ) 37 ) 34 ) 29 ) 27 ) 25 ) 24 ) 22 ;

I would like to have the following output:

node 23 includes 'ar' and 'tk',
node 32 includes nodes 'egc' and 33 which include both 'kme' and 'ame'
[...]

I can write scripts to capture nodes with two or three children, however, it gets messy when it comes to capturing the nodes at the end.

Cheers,

parsing tree phylogenetics • 8.5k views

ADD COMMENT • link updated 13.7 years ago by jhc ★ 3.0k • written 13.8 years ago by User 1940 ▴ 80

Entering edit mode

13.8 years ago

jhc ★ 3.0k

I would not rewrite the newick parser from scratch unless I had a very good reason. Have you tried any of tree manipulation libraries available (bio::phylo, ETE, etc.)?

Here it goes a very easy implementation using Python and ETE. You can customize the output as you wish.

from ete2 import Tree

def children_names(n):
    return ','.join([ch.name for ch in n.children])

nw = "((ar, tk) 23 , (abc, ((xa, xb) 26 , ((lima, limb) 28 , (((spi, sfi) 31 , (eqc, (kme, ame) 33 ) 32 ) 30 , ((fxf, (tva, tvb) 36 ) 35 , (fxe, (cjk, ((mrb, (qq, rr) 41 ) 40 , sgn) 39 ) 38 ) 37 ) 34 ) 29 ) 27 ) 25 ) 24 ) 22 ;"

tree = Tree(nw, format=1) # newick subformat 1 to read internal node names
for node in tree.traverse("postorder"):
    # A verbose output
    if not node.is_leaf() and node.up: # For all internal nodes
        print node.name, "whose parent is", node.up.name, \
        "has", len(node.children), "children named", children_names(node), \
        "and groups the following leaves:", ",".join(node.get_leaf_names())
    elif not node.is_leaf(): # In case we are at the root of the tree
        print node.name, "who is the root of the tree", \
        "has", len(node.children), "children named", children_names(node), \
        "and groups the following leaves:", ",".join(node.get_leaf_names())

I know this is not the most efficient way to browse the content of internal branches but, if your trees are not huge, few lines of code are more than enough.

23 whose parent is 22 has 2 children named ar,tk and groups the following leaves: ar,tk
26 whose parent is 25 has 2 children named xa,xb and groups the following leaves: xa,xb
28 whose parent is 27 has 2 children named lima,limb and groups the following leaves: lima,limb
31 whose parent is 30 has 2 children named spi,sfi and groups the following leaves: spi,sfi
33 whose parent is 32 has 2 children named kme,ame and groups the following leaves: kme,ame
32 whose parent is 30 has 2 children named eqc,33 and groups the following leaves: eqc,kme,ame
30 whose parent is 29 has 2 children named 31,32 and groups the following leaves: spi,sfi,eqc,kme,ame
36 whose parent is 35 has 2 children named tva,tvb and groups the following leaves: tva,tvb
35 whose parent is 34 has 2 children named fxf,36 and groups the following leaves: fxf,tva,tvb
41 whose parent is 40 has 2 children named qq,rr and groups the following leaves: qq,rr
40 whose parent is 39 has 2 children named mrb,41 and groups the following leaves: mrb,qq,rr
39 whose parent is 38 has 2 children named 40,sgn and groups the following leaves: sgn,mrb,qq,rr
38 whose parent is 37 has 2 children named cjk,39 and groups the following leaves: cjk,sgn,mrb,qq,rr
37 whose parent is 34 has 2 children named fxe,38 and groups the following leaves: fxe,cjk,sgn,mrb,qq,rr
34 whose parent is 29 has 2 children named 35,37 and groups the following leaves: fxf,fxe,tva,tvb,cjk,sgn,mrb,qq,rr
29 whose parent is 27 has 2 children named 30,34 and groups the following leaves: spi,sfi,eqc,fxf,fxe,kme,ame,tva,tvb,cjk,sgn,mrb,qq,rr
27 whose parent is 25 has 2 children named 28,29 and groups the following leaves: lima,limb,spi,sfi,eqc,fxf,fxe,kme,ame,tva,tvb,cjk,sgn,mrb,qq,rr
25 whose parent is 24 has 2 children named 26,27 and groups the following leaves: xa,xb,lima,limb,spi,sfi,eqc,fxf,fxe,kme,ame,tva,tvb,cjk,sgn,mrb,qq,rr
24 whose parent is 22 has 2 children named abc,25 and groups the following leaves: abc,xa,xb,lima,limb,spi,sfi,eqc,fxf,fxe,kme,ame,tva,tvb,cjk,sgn,mrb,qq,rr
22 who is the root of the tree has 2 children named 23,24 and groups the following leaves: ar,tk,abc,xa,xb,lima,limb,spi,sfi,eqc,fxf,fxe,kme,ame,tva,tvb,cjk,sgn,mrb,qq,rr

ADD COMMENT • link 13.8 years ago by jhc ★ 3.0k

Entering edit mode

13.8 years ago

DG 7.3k

No need to reinvent the wheel. If you are coding in Perl use BioPerl's Phylogenetic Tree modules to take care of the tree parsing. Then you can focus on whatever you want to do with this information. Bioperl will store the tree in an appropriate data structure, so for all nodes you can get all descendants of that node.

ADD COMMENT • link 13.8 years ago by DG 7.3k

Entering edit mode

Hey, thanks. Do you know a script compatible with BioPerl modules that would do what I want?

ADD REPLY • link 13.8 years ago by User 1940 ▴ 80

Entering edit mode

The [?]Bioperl Phylogenetics and Analysis Scrapbook[?] has a code snippet for returning all Clades in a tree that should be useful for you.

ADD REPLY • link 13.8 years ago by DG 7.3k

Entering edit mode

The Bioperl Phylogenetics and Analysis Scrapbook has a code snippet here: http://www.bioperl.org/wiki/Finding_all_clades_represented_in_a_tree that should be useful. It returns all Clades in a tree. This will only show the leaf taxa that form all possible clades but could easily be modified to show the names of internal nodes as well if desired.

ADD REPLY • link 13.8 years ago by DG 7.3k

Entering edit mode

13.8 years ago

Pierre Lindenbaum 166k

What you need is a lexer/parser like LEX/YACC/FLEX/Bison or javacc.

here is a simple implementation using javacc:


	options {
	STATIC=false;
	}

	PARSER_BEGIN(Biostar10100)

	import java.util.*;

	@SuppressWarnings("unchecked")
	public class Biostar10100
	{

	private static abstract class Node
	{
	Node parent;
	abstract void print(int depth);
	}

	private static class TerminalNode extends Node
	{
	String name;
	void print(int depth)
	{
	System.out.print("'"+name+"'");
	}

	}

	private static class Pair extends Node
	{
	Node child1;
	Node child2;
	int score;

	void print(int depth)
	{
	System.out.print("Node "+score+" includes nodes ");
	child1.print(depth+1);
	System.out.print(" and ");
	child2.print(depth+1);
	if(depth==0) System.out.println();
	if(child1.getClass()==getClass()) child1.print(0);
	if(child2.getClass()==getClass()) child2.print(0);
	}
	}

	/** just a test */
	public static void main(String args[]) throws Exception
	{
	new Biostar10100(System.in).input().print(0);
	}

	}

	PARSER_END(Biostar10100)

	SKIP :
	{
	" "
	\| "\t"
	\| "\n"
	\| "\r"
	}

	TOKEN : /* LITERALS */
	{
	<#LETTER:(["a"-"z"]\|["A"-"Z"]\|"_")>
	\| <#DIGIT:["0"-"9"]>
	\| <LPAR: "(">
	\| <RPAR: ")">
	\| <COMMA: ",">
	\| <SEMICOLON: ";">
	\| <INTEGER: (<DIGIT>)+ >
	\| <IDENTIFIER: <LETTER> (<DIGIT>\|<LETTER>)* >
	}

	private Pair input():{Pair p=null;}
	{
	p=pair() <SEMICOLON> <EOF> { return p;}
	}

	private Pair pair():{Node n1; Node n2; Token t;}
	{
	<LPAR> n1=any() <COMMA> n2=any() <RPAR> t=<INTEGER>
	{
	Pair p=new Pair();
	p.child1=n1;
	p.child2=n2;
	n1.parent=p;
	n2.parent=p;
	p.score=Integer.parseInt(t.image);
	return p;
	}
	}

	private Node any():{Node n=null;}
	{
	(n=terminal()\|n=pair()) { return n;}
	}

	private TerminalNode terminal():{Token t;TerminalNode n=new TerminalNode(); }
	{
	t=<IDENTIFIER> { n.name=t.image;return n;}
	}

view raw Biostar10100.javacc hosted with ❤ by GitHub

compile:

javacc Biostar10123.jj
javac Biostar10123.java

execute

$ echo "((ar, tk) 23 , (abc, ((xa, xb) 26 , ((lima, limb) 28 , (((spi, sfi) 31 , (eqc, (kme, ame) 33 ) 32 ) 30 , ((fxf, (tva, tvb) 36 ) 35 , (fxe, (cjk, ((mrb, (qq, rr) 41 ) 40 , sgn) 39 ) 38 ) 37 ) 34 ) 29 ) 27 ) 25 ) 24 ) 22 ;" |\
java Biostar10123

Node 22 includes nodes Node 23 includes nodes 'ar' and 'tk' and Node 24 includes nodes 'abc' and Node 25 includes nodes Node 26 includes nodes 'xa' and 'xb' and Node 27 includes nodes Node 28 includes nodes 'lima' and 'limb' and Node 29 includes nodes Node 30 includes nodes Node 31 includes nodes 'spi' and 'sfi' and Node 32 includes nodes 'eqc' and Node 33 includes nodes 'kme' and 'ame'Node 33 includes nodes 'kme' and 'ame'
Node 31 includes nodes 'spi' and 'sfi'
Node 32 includes nodes 'eqc' and Node 33 includes nodes 'kme' and 'ame'
Node 33 includes nodes 'kme' and 'ame' and Node 34 includes nodes Node 35 includes nodes 'fxf' and Node 36 includes nodes 'tva' and 'tvb'Node 36 includes nodes 'tva' and 'tvb' and Node 37 includes nodes 'fxe' and Node 38 includes nodes 'cjk' and Node 39 includes nodes Node 40 includes nodes 'mrb' and Node 41 includes nodes 'qq' and 'rr'Node 41 includes nodes 'qq' and 'rr' and 'sgn'Node 40 includes nodes 'mrb' and Node 41 includes nodes 'qq' and 'rr'
Node 41 includes nodes 'qq' and 'rr'
Node 39 includes nodes Node 40 includes nodes 'mrb' and Node 41 includes nodes 'qq' and 'rr'Node 41 includes nodes 'qq' and 'rr'
(...)

ADD COMMENT • link 9.9 years ago by Pierre Lindenbaum 166k

Entering edit mode

That's true, writing the parsers can be fun and a good learning experience I won't deny that :) But as a personal philosophy I think we should be supporting the open Bio coding communities as much as possible.

ADD REPLY • link 13.8 years ago by DG 7.3k

Entering edit mode

"Why reinvent the wheel by writing a new parser?" Writing a parser is FUN + NO dependencies + did I say that writing a parser is FUN ? :-)

ADD REPLY • link 13.8 years ago by Pierre Lindenbaum 166k

Entering edit mode

Thanks Pierre. I'm downloading Netbeans now to give a try with this.

ADD REPLY • link 13.8 years ago by User 1940 ▴ 80

Entering edit mode

Thanks Pierre but I could not run it. Can you give me a simple instruction for how to use this implementation?

ADD REPLY • link 13.8 years ago by User 1940 ▴ 80

Entering edit mode

Perhaps you have to do a "apt-get install javacc" or download and install javacc before running it (it works for me).

By the way, thanks for this post, I will dig a bit on javacc, it will be useful for me.

ADD REPLY • link 13.8 years ago by Fred ▴ 790

Entering edit mode

I downloaded javacc from its website, gunzipped and then tarred. I saved the script on the top as Biostar10123.java, then runned javac Biostar10123.java - it returns 23 errors. And when I try to execute, I get the following error:

Exception in thread "main" java.lang.NoClassDefFoundError: Biostar10123 Caused by: java.lang.ClassNotFoundException: Biostar10123 at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.jav

ADD REPLY • link 13.8 years ago by User 1940 ▴ 80

Entering edit mode

No the "script" should be called "Biostar10123.jj" not " Biostar10123.java". see my "Compile" paragraph

ADD REPLY • link 13.8 years ago by Pierre Lindenbaum 166k

Entering edit mode

Why reinvent the wheel by writing a new parser? The tree parsers for Bioperl, Biopython, and Biojava should all do the job and do it well, and can handle any sort of valid newick format tree (and many also support nexus and phyloxml tree files as well. Makes any resulting code much more reusable and will work on non-strictly bifurcating trees as well.

ADD REPLY • link 13.8 years ago by DG 7.3k

Entering edit mode

Unless, like me, you want to work in Objective C. I would kill for a nice newick parser (or a consistent and speedy algorithm for parsing across OpenBio). I'm currently messing around with ParseKit, though (http://www.parsekit.com).

ADD REPLY • link 13.5 years ago by Chris F. ▴ 20