Question

Is that possible to use Neighbor Joining algorithm for a distance matrix with more than 1M taxa

0

Entering edit mode

9.3 years ago

ajingnk ▴ 130

I have a distance matrix with more than 1M nodes. I wonder whether there exist any Neighbour Joining implementation which could calculate a tree structure within a reasonable amount of time. The complexity for Neighbour Joining is N^3, so I am not sure whether there is any heuristic algorithm could do this.

Currently, I have a distance matrix. If there is another comparable tree building algorithm starting from distance matrix, I would like to try that method also.

phylo tree • 5.0k views

ADD COMMENT • link updated 7.8 years ago by Biostar 20 • written 9.3 years ago by ajingnk ▴ 130

1

Entering edit mode

1M*1M=1TB. If you have a matrix in float, that is 4TB. An algorithm are likely to keep a few such matrices. Do you have a machine with ~10TB RAM? If your main purpose is clustering, there are more appropriate algorithms than NJ.

ADD REPLY • link 9.3 years ago by lh3 33k

0

Entering edit mode

There are in fact efficient algorithms that do not require having the entire matrix loaded into memory.

ADD REPLY • link 9.3 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

I haven't seen a popular implementation of hierarchical clustering with complexity less than N^2

ADD REPLY • link 9.3 years ago by ajingnk ▴ 130

0

Entering edit mode

What will you do next after you get the tree? If you are to cluster nodes into groups, you can do clustering directly without a tree. There are many clustering algorithms. Hierarchical clustering is just a very basic one.

ADD REPLY • link 9.3 years ago by lh3 33k

0

Entering edit mode

I need the tree structure for visualization.

ADD REPLY • link 9.3 years ago by ajingnk ▴ 130

0

Entering edit mode

Suppose you managed to do the clustering. How do you imagine showing the hierarchical structure of a tree with 1M leaf nodes? It sounds to me like your actual problem is much more about big-data visualization than it is about the computational challenge.

ADD REPLY • link 9.3 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

A tree with 1M leaf nodes is about 2M nodes and 2M edges. I was thinking to use Gephi.. But, I guess I should reduce the size of the data first.

ADD REPLY • link 9.3 years ago by ajingnk ▴ 130

0

Entering edit mode

If the starting point is an all-against-all similarity matrix, it is theoretically impossible to do better than N^2, since just reading through the similarity matrix is N^2.

ADD REPLY • link 9.3 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

Yes, you are right. N^2 is the best the algorithm can do.

ADD REPLY • link 9.3 years ago by ajingnk ▴ 130

Ram · Answer 1 · 2015-08-22

1

Entering edit mode

9.3 years ago

Lars Juhl Jensen 11k

It is pushing the limits but likely possible. Although the worst-case performance of Neighbour Joining is N^3, there are heuristics that evolve N^2 on most actual data. I would take a look at RapidNJ and related work by the same group. NINJA might be a viable option too. Extrapolating based on N^2 scaling and a runtime of 100 seconds for 8000 taxa, you are looking at runtime in the order of a few weeks.

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 9.3 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

Thanks Lars! So, RapidNJ and NINJA take care of the memory issue also?

ADD REPLY • link 9.3 years ago by ajingnk ▴ 130

0

Entering edit mode

I am not sure. I think eRapidNJ (the paper I linked to) is where it was introduced, but I have not found the code. If RapidNJ takes too much memory, I would suggest you contact the authors of the eRapidNJ article.

ADD REPLY • link 9.3 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

The paper said that eRapidNJ still loads "the lower triangular matrix" into RAM. That is 2TB.

ADD REPLY • link 9.3 years ago by lh3 33k

0

Entering edit mode

The authors state that *first* they reduce the memory requirement by a factor of two by storing only the lower triangular matrix. On the next page they explain how they further reduce the memory consumption by not storing only some columns in memory at a time.

ADD REPLY • link 9.3 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

My impression is the column-only trick is only applied to S, not the distance matrix D? "The size of S is reduced by only storing as many columns of S as can fit in the available internal memory after D has been loaded."

ADD REPLY • link 9.3 years ago by lh3 33k

0

Entering edit mode

You may be right. However, one thing really mystifies me: in the abstract the authors state that the N^2 memory consumption becomes a problem for large trees, and that the new algorithm is applicable to very large trees. That does not make much sense if the new algorithm also requires N^2 memory, does it?

ADD REPLY • link 9.3 years ago by Lars Juhl Jensen 11k