Dear community,
I ran phyml on a gene family to build a tree. Looking at the results, I'm a bit worried about the log-likelihood value: it's -754, which means the likelihood is almost zero! Does this mean that the program has little confidence in the estimated parameters or the tree topology? I was wondering if I'm understanding this incorrectly.
Thank you so much!!
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
--- PhyML 3.3.20190909 ---
http://www.atgc-montpellier.fr/phyml
Copyright CNRS - Universite Montpellier
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
. Sequence filename: exon3_wb_aligned_phy
. Data set: #1
. Initial tree: BioNJ
. Model of nucleotides substitution: GTR
. Number of taxa: 52
. Log-likelihood: -754.13849
. Unconstrained log-likelihood: -348.41766
. Composite log-likelihood: -6438.01266
. Parsimony: 119
. Tree size: 1.35199
. Discrete gamma model: Yes
- Number of classes: 4
- Gamma shape parameter: 1.901
- Relative rate in class 1: 0.28116 [freq=0.250000]
- Relative rate in class 2: 0.64406 [freq=0.250000]
- Relative rate in class 3: 1.06730 [freq=0.250000]
- Relative rate in class 4: 2.00748 [freq=0.250000]
. Nucleotides frequencies:
- f(A)= 0.37232
- f(C)= 0.24092
- f(G)= 0.17327
- f(T)= 0.21350
. GTR relative rate parameters :
A <-> C 0.82212
A <-> G 1.82689
A <-> T 0.53724
C <-> G 0.17829
C <-> T 2.00016
G <-> T 1.00000
. Instantaneous rate matrix :
[A---------C---------G---------T------]
-0.82453 0.25951 0.41474 0.15028
0.40104 -1.00102 0.04048 0.55950
0.89119 0.05628 -1.22720 0.27973
0.26208 0.63136 0.22702 -1.12046
. Run ID: none
. Random seed: 1625516914
. Subtree patterns aliasing: no
. Version: 3.3.20190909
. Time used: 0h0m4s (4 seconds)
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
Suggested citations:
S. Guindon, JF. Dufayard, V. Lefort, M. Anisimova, W. Hordijk, O. Gascuel
"New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0."
Systematic Biology. 2010. 59(3):307-321.
S. Guindon & O. Gascuel
"A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood"
Systematic Biology. 2003. 52(5):696-704.
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooooo
Thanks for the reply! May I follow up with a question? Since the LL value is dataset dependent, for a particular dataset, how do I know that my LL is good enough?
Thanks again!!
Like with any other sampling method that incompletely covers the total event space, you can never be sure that your LL is the best it can be. After all, there is no target number that is known ahead of time.
One way around it is to run multiple tree reconstructions (at least 100, and 1000 is even better), and to calculate bootstrap support for each tree branch. Opinions vary, but most people would probably agree that branches with >70-80% bootstrap support are reliable.
Yet another way is to do a Bayesian analysis, which runs at least two independent tree reconstructions for a very long number of sampling generations (at least a million, but more is better). If they independently converge to a similar LL value, that would support the idea that the resulting tree is close to a global maximum of LL. There is a quantity called
standard deviation of split frequencies
(SDSF) that tells you how well the tree reconstructions match each other. SDSF converges to 0 when tree reconstructions are identical, but for practical purposes SDSF < 0.01 is accepted as a sign of convergence. Independently, Bayesian methods will give posterior probabilities (0-1 scale) to each branch, and their meaning is comparable to bootstrap support in ML methods.