How to convert PHYLIP to NEWICK format (pairwise distance matrix)
2
0
Entering edit mode
21 months ago
Linda • 0

Hi, I'm stuck on how to convert a pairwise-distance matrix or phylim format into newick format.

Newick format is the required input for plotting a tree for many programs yet there seems to me no straightforward method on how to build a newick file.

phylip format

I'm gravitating towards using the Bio Phylo package however it cannot read phylim format so therefore can't convert it into newick, nexus, phyloxml etc.

There is an answer here but I'm not sure I can verify it works https://stackoverflow.com/questions/9364609/converting-ndarray-generated-by-hcluster-into-a-newick-string-for-use-with-ete2/17657426#17657426

There are previous questions about this on here which remain unanswered - here and here

I have tried using tools such as FastMe however it cannot handle a large matrix and is very slow. I am willing to alter the distance matrix format as necessary.

If anyone has any recommendations I would greatly appreciate it! Thanks!

python phylogeny • 2.1k views
ADD COMMENT
1
Entering edit mode

It may be useful to add "pair-wise distance matrix" to title itself since your question is specifically about that.

ADD REPLY
0
Entering edit mode

Where are you getting the matrix from? Can the tool that gave it you not create a tree? It's unusual to only spit out the matrix.

If you can get the distance matrix into the right format, its possible you could 'hack this is' to the Bio.Phylo.TreeConstruction module and 'trick' it in to thinking it created the object in the first place so you can apply all the other methods it has available.

You haven't specified what kind of tree you intend to derive from this table though - NJ, UPGMA, ML etc?

ADD REPLY
1
Entering edit mode
21 months ago
Joe 21k

I'm afraid I'm too lazy to write out by hand your entire matrix (please provide text rather than screenshots for data and code), but here's how you'd 'hack' it in to BioPython. If you can provide the raw data, it's not that difficult to parse as a plain file and convert it in to lists as needed below.

(The code in question is mostly here: https://github.com/biopython/biopython/blob/master/Bio/Phylo/TreeConstruction.py#L314)

from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceMatrix

names = ['Aurora', 'Boylii', 'Cascadae']
values = [[0], [0.1, 0], [0.13, 0.7, 0]] #extend these lists as required for the whole matrix.

dm = DistanceMatrix(names=names, matrix=values)
constructor = DistanceTreeConstructor()

njtree = constructor.nj(dm) #change this line if you want something other than neighbour joining
print(njtree)

# Output:
# Tree(rooted=False)
#     Clade(branch_length=0, name='Inner1')
#         Clade(branch_length=-0.23499999999999996, name='Aurora')
#         Clade(branch_length=0.33499999999999996, name='Boylii')
#         Clade(branch_length=0.365, name='Cascadae')

You can then satisfy yourself that this is a tree:

Phylo.draw_ascii(njtree)
 , Aurora
 |
_|______________________________________________________________ Boylii
 |
 |____________________________________________________________________ Cascadae

I couldn't figure out a more elegant way to do this, though I'm sure there is one, but you can export this as a newick string like so:

print(njtree.format("newick"))
(Aurora:-0.23500,Boylii:0.33500,Cascadae:0.36500)Inner1:0.00000;
ADD COMMENT
1
Entering edit mode
21 months ago
Joe 21k

Just for fun I had ChatGPT try this too. Here's what it came up with.

A few notable differences:

  • it uses numpy
  • it is using a full matrix rather than triangular half matrix as biopython does
  • It only produced a cladogram, so there are no branch lengths (I did tell the AI this, and it can produce updated code to change that) YMMV.

ChatGPT source:

import numpy as np

def create_newick_tree(dist_matrix, labels):
    """Create a Newick format tree from a distance matrix and labels."""
    num_taxa = len(labels)

    # Initialize the node labels
    node_labels = labels.copy()

    # Initialize the distance matrix
    dist = dist_matrix.copy()

    # Create the initial set of nodes
    nodes = [i for i in range(num_taxa)]

    # Keep track of the next available node ID
    next_node_id = num_taxa

    # Create the tree
    while len(nodes) > 1:
        # Find the two closest nodes
        min_dist = np.min(dist[np.triu_indices(num_taxa, k=1)])
        min_i, min_j = np.where(dist == min_dist)
        min_i, min_j = min_i[0], min_j[0]

        # Create a new node to represent the cluster
        new_node_id = next_node_id
        next_node_id += 1

        # Update the node labels and distances
        node_labels[new_node_id] = "(" + node_labels[min_i] + "," + node_labels[min_j] + ")"
        for k in range(num_taxa):
            if k != min_i and k != min_j:
                dist[min_i,k] = (dist[min_i,k] + dist[min_j,k]) / 2.0
                dist[k,min_i] = dist[min_i,k]
        dist = np.delete(dist, min_j, axis=0)
        dist = np.delete(dist, min_j, axis=1)
        node_labels = np.delete(node_labels, min_j)

        # Add the new node to the list of nodes
        nodes.append(new_node_id)

    # Return the Newick format string
    return node_labels[0] + ";"

# Example usage
dist_matrix = np.array([[0, 5, 9, 9], [5, 0, 10, 10], [9, 10, 0, 8], [9, 10, 8, 0]])
labels = np.array(["A", "B", "C", "D"])
newick_tree = create_newick_tree(dist_matrix, labels)
print(newick_tree)

It claims the result will be:

((A,B),(C,D));

DISCLAIMER I haven't confirmed that this is correct at anything other than a glance. Just include it for 'fun'.

I did challenge it to re-implement this task using biopython and it give a very similar solution to mine above.

ADD COMMENT

Login before adding your answer.

Traffic: 2911 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6