Interpreting phylogenetics tree
2
0
Entering edit mode
4 months ago
davidmaimoun ▴ 50

Hello everyone,

I am new in phylogenetics. I drew a tree based on SNP. I used parsnp to make the alignment and generate the tree. I get the final tree via iTol

enter image description here

but I don't know what represent the numbers on the branches. It seem to be a rapport between 2 values. I'd glad get some help

Thank you

phylogenetic-tree itol phylogenetics • 2.0k views
ADD COMMENT
2
Entering edit mode

Also, it may make more sense to display node bootstrap support values in a dendrogram if they are generated by the tool.

ADD REPLY
1
Entering edit mode

You have this tree laid out in what is commonly referred to as a "cladogram" view, which despite displaying the branch lengths, does not render them differently. If you switch to a classic dendrogram view you'll be able to visually see what the numbers are doing (but will still need to read up about what they mean and how they're derived).

ADD REPLY
0
Entering edit mode

I understand know! Thank you for this precious help!

ADD REPLY
1
Entering edit mode
4 months ago
Mark ★ 1.6k

I believe parsnp runs fasttree under the hood to generate the trees. Therefore those are maximum likelihood (ML) branch lengths. See this link for how to interpret branch lengths on an ML tree:

What does mean branch length of Maximum likelihood tree?

ADD COMMENT
1
Entering edit mode
4 months ago

Those numbers are branch lengths. There are different ways to calculate branch lengths and I'll leave it to you to read up on how phylogenetic trees are constructed but basically each number represents how similar/different two "sequences" are. I put quotation marks around "sequences" to highlight that these are not necessarily sequences you fed into the software as FASTA files but might be computer-generated sequences produced by the software that did the alignment. For example, the top two branches of your tree show that sequences NM11-7.fasta.ref and NM126.fasta both have a branch length of 0 relative to the "root node" on the left side of the tree; this probably means that both of these two sequences are very similar to a consensus sequence representing all the sequences in your tree and I would guess that these two sequences are identical to each other. I'm guessing because I haven't used the software you mentioned and you haven't provided any details about the commands/options/settings you used when producing the alignment or the tree. Looking at the bottom of your tree the sequences NM162.fasta and NM214.fasta are showing branch lengths of 0.003 and 0.006 indicating that one is more similar than the other to another computer-generated sequence that is represented by the node they each connect to; that same node has a branch length of 0.003 showing how similar it is to yet another computer-generated sequence representing the next node over.

ADD COMMENT
0
Entering edit mode

Hi Could you help me to understand

"similar than the other to another computer-generated sequence"

from your comment?

Thank you

ADD REPLY
3
Entering edit mode

OK, so in the picture below I've labeled two nodes, A and B. You can think of these nodes as theoretical l, computer-generated sequences that may be used during the tree-building process to map out how similar different subsets of your input sequences are to each other. As Joe pointed out, the connectivity of the tree can be inferred mathematically without generating the theoretical sequences for each node but I think it's still helpful to think of the nodes as theoretical sequences that are relatively more similar to the nodes in the tree that they are closest to.

With that in mind, theoretical sequence A is more similar to NM162.fasta than it is to NM214.fasta and it's more similar to these two input sequences than any of your other input sequences (the ones named at the terminal nodes (aka "leaves") of the tree. Theoretical sequence A is about as similar to theoretical sequence B as it is to NM162.fasta (0.003); this might mean a difference of the same exact number of bases/amino acids between each pair of sequences or it could be the comparison places greater importance (aka "weight") on some differences compared to others - some algorithms treat a T<-->C or a G<-->A change as a smaller difference than a purine (A or G) swapping with a pyrimidine (T or C). Again, I've never used this particular software so I'm speculating on the particulars but hopefully this can help conceptually.

Annotated screenshot of phylogenetic tree

ADD REPLY
0
Entering edit mode

It was very helpful guys, thank you very much!

ADD REPLY
1
Entering edit mode

AFAIK Some tree-building algorithms attempt to 'infer' what a likely ancestral sequence was. That is to say, the 'real' sequences you provided to the software form the leaves/tips of the tree. But at each of the bifurcations along the tree to the root represent a hypothetical ancestor sequence of the ones that descend from it.

Some tools may literally build this sequence (hence computer generated), though I don't think its necessarily essential that they do so as they can operate on the likelihoods etc instead.

ADD REPLY
0
Entering edit mode

Thank you very much it is very helpful.

And for the branch length, is there a way to display the percent of similarity in the start of the branchs![enter image Something like this:

enter image description here

For me it will more understandable I think

Thank you!

ADD REPLY
1
Entering edit mode

You can edit node values in the tree file to be whatever you like, but there are some important nuances to appreciate here:

If all you want is a tree representation of a multiple alignment ("neighbour joining") then you can output this from tools like SeaView and Clustal if memory serves.

I don't know exactly what ParSNP does to build its trees, but don't fall in to the trap of thinking that the branch lengths are a measure of sequence similarly directly (they do tell you which sequences are most like one another, but its not usually a simple comparison of string similarity). If you have used a "proper" phylogeny building approach such as a Maximum Likelihood tree, these have models of evolution baked in to them which make predictions about which substitutions are more or less likely and how many substitutions might have happened.

It may be that ParSNP does do simpler comparisons and so this is valid, but I would check to be sure.

ADD REPLY
0
Entering edit mode

but don't fall in to the trap of thinking that the branch lengths are a measure of sequence similarly directly

Thank you for the precision, because I was thinking that.

Thank you very much

ADD REPLY
0
Entering edit mode

Is there a algorithm/model you would advise for epidemiological tracking of bacteria (for public health institute). From what I understand, ML is better so it why here i used parsnp. The tree is joined was drawed by iTol.

Alternatively use raxml

raxml-ng --all --model GTR+G --msa $msa --threads ${cpus}

However in my office, they use NJ with 'JC69' model, or more often 'Pearson' via Bionumerics but nobody can tell me why. My task is to replace Bionumerics tools they used to manipulate, via open source code.

Is my approach better?

Thank you very much and have an excellent day everyone!

ADD REPLY
0
Entering edit mode

Please use Add Reply/Add Comment when responding to existing posts to keep the logical flow.

Add Answer should only be used for adding new answers for the original question.

ADD REPLY
0
Entering edit mode

oh very sorry

ADD REPLY
0
Entering edit mode

Any time I've had an interaction with "proper" phylogenetics people, they tend to use ML methods AFAIK.

The other tool that comes up a lot is BEAST which has quite sophisticated molecular clock models underlying it. That comes with some pretty steep sample/dataset requirements for it to be accurate though so your mileage may vary.

ADD REPLY
0
Entering edit mode

If you ever run into this problem, which I call 'because that's how I was taught', be very very suspicious of everything else including the solution given.

NJ + JC69 might be the best possible solution but I would hazzard a guess it was done 20 years ago this way due to computational limits AND available software. We generally dont have this problem and haven't in the last... 10 years? Longer?.

ADD REPLY
1
Entering edit mode

Also note that people are used to finding node support values where you propose to edit nodes. This will confuse readers and therefore I wouldn't label the nodes differently. If you want to plot pairwise differences, you may add a heatmap matrix plot.

ADD REPLY

Login before adding your answer.

Traffic: 4131 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6