I was wondering whether the following problem has been considered in the literature. Let T be a typical, binary, leaf-labelled phylogenetic tree, with leaves x1, x2, ..., xn, and suppose I receive a new sequence y that is supposed to be related to the sequences I already have. Can one determine whether y can be inserted in that tree, either as an ancestor of my available sequences or as a new leaf, and if multiple choices are possible, what is (are) the most likely option(s)?
I agree with those above that sometimes re-evaluating the tree gives better results, but it might not be the optimum solution with lots of things to add to a big tree. Have a look at pplacer. It describes itself like this
"Pplacer places query sequences on a fixed reference phylogenetic tree to maximize phylogenetic likelihood or posterior probability according to a reference alignment. Pplacer is designed to be fast, to give useful information about uncertainty, and to offer advanced visualization and downstream analysis"
There are times when we want to keep the topology. You can:
Add the new sequence to the existing multialignment with muscle, or any aligner that supports profile alignment.
Build the tree with "treebest nj -c old_tree.nh new_alignment.fasta". The topology of the old tree is always the same as the input. This is "hard" constraining.
Alternatively, you can build the tree with "treebest phyml -C old_tree.nh new_alignment.fasta". However in this case, the topology of the old tree might be changed if the alignment strongly agree with an alternative topology. This is "soft" constraining.
"nj" does constrained neighbour joining. It is described in a PhD thesis. "phyml" is modified from an old version of Phyml program. It penalizes bisections that disagree with the input topology. You cannot find the detailed description of that algorithm.
I do not know what the literature says regarding this scenario, but I know from practice - doing this myself and talking to other scientists who have either written phylogenetic analysis software or who have done this type of analysis - that adding a sequence or taxa to a tree is best done by re-evaluating all relationships. In other words, you should recalculate the tree because the new sequence can alter many relationships between pairs of "old" sequences as well as have many relationships to both those "old" sequences and the ancestral sequences at each node.
PAGAN will accept new sequences (reads) to be added to an existing alignment (ref_alignment) and associated tree (ref_tree). The tree can be labeled in NHX format for a subset of nodes to try, or use the slower --exhaustive option.
Just for completeness, I would also suggest MLTreeMap, which seems to be very similar to pplacer.
+1 Sound great!
Great, thanks a lot!