Ok, I haven't worked it in to my other code just yet, but here's some approaches you could use based around ete3
:
from ete3 import Tree
import sys
from statistics import median
with open(sys.argv[1], 'r') as handle:
t = Tree(handle.readline())
nodes = [node for node in t.traverse()]
# Get all branch lengths:
print('Tree = {}'.format(str(sys.argv[1])))
print('br_lens')
for node in nodes:
print(node.dist)
print('AVERAGE: {}'.format(float(sum([node.dist for node in nodes])/len(nodes))))
print('MEDIAN: {}'.format(median([node.dist for node in nodes])))
# Support is basically a case of doing the same as the above.
print('\n')
print('node_support')
for node in nodes:
print(node.support)
print('AVERAGE: {}'.format(float(sum([node.support for node in nodes])/len(nodes))))
print('MEDIAN: {}'.format(median([node.support for node in nodes])))
Given the input as bs.tree
:
$ cat bs.tree
((logi|XP_009052348.1:0.30900,(dapu|EFX67985.1:0.40918,dare|NP_001007771.1:0.18921)0.580:0.08422)0.826:0.09733,cate|ELT98251.1:0.29370,(lian|XP_013420576.1:0.18354,((ocbi|XP_014783723.1:0.22136,(neve|XP_001634838.1:1.09355,(scma|XP_018652019.1:0.58808,ecmu|CDS40328.1:0.89059)0.920:0.47871)0.738:0.11872)0.572:0.03167,hero|XP_009022332.1:0.79732)0.005:0.06582)0.790:0.07221);
$ python3 script.py bs.tree
Tree = bs.tree
br_lens
0.0
0.09733
0.2937
0.07221
0.309
0.08422
0.18354
0.06582
0.40918
0.18921
0.03167
0.79732
0.22136
0.11872
1.09355
0.47871
0.58808
0.89059
AVERAGE: 0.3291227777777778
MEDIAN: 0.205285
node_support
1.0
0.826
1.0
0.79
1.0
0.58
1.0
0.005
1.0
1.0
0.572
1.0
1.0
0.738
1.0
0.92
1.0
1.0
AVERAGE: 0.8572777777777777
MEDIAN: 1.0
It's not the most elegant code in the world (it could probably be refactored to a function rather than loads of printing and list comprehensions) but hopefully that's close enough to what you need to suffice.
If you want to apply it to lots of trees I'd suggest doing something like:
$ for tree in *.tree ; do python3 script.py "$file" > "${file%.*}"_output.txt ; done
(or look in to parallel processing with GNU parallel
or similar).
How do you need the output? As labels within a plotted tree? You might want to consider that Newick format is not ideal for this, because it is not set that the values you need even are in there. All depends on the software writing the output. If you could provide a little example it might help, but this format is rather easy to parse.
Michael – thanks for your response. The need is just a text output per each text tree: all branch lengths (or average), all node supports (or average). The values not need be in one file. Extracting these values from trees is now the step that would require ad hoc scripting if a nice pipelinable soft is missing around.. I do not have an example tree at hand for the moment, but those are nice standard newicks output by FastTree.
If you need the branch lengths, they're already encoded inside the newick format - what do you want to do with the values?
Here's some code I wrote a while back to work out distances in trees:
https://github.com/jrjhealey/bioinfo-tools/blob/master/tree_dists.py
You can use it like so, to get the pairwise distances between all tips:
python tree_dists.py -m all -s newick -i mytreefile.tree
.You can also use it like so, to get the distance between the 2 most distant tips:
python tree_dists.py -m max -s newick -i mytreefile.tree
. The max is default, so you can also run this without-m
to get the same result. If it's useful, I'll be happy to edit the code to alter the output formats or to provide other calculation options.Dear jrj.healey – my need for this case is rather simple: to basically parse newick and extract numerical values of branch lengths and node support. I will then do basic statistics to assess trees and bin them further. This is a constitutive step of our phylogenomic approach to analyse orthology groups. Did you think of making your script extract such values?
I can look in to it. It shouldnt be difficult. ETE3s object model stores nodes with their associated values I believe.
Could you mock up some example input and output you'd expect?
Ok, example in-outs are like this.
Input would be a standard newick:
Output would be plain text columns:
And so on for each of n (many thousands) trees. If it all goes to one or separate files – whatever is easier to implement. Please let me know if I made sense.