I calculate the pangenome for a set of bacterial genomes I am working with. I have a classical pangenome matrix that looks like this:
speciesA speciesB ... speciesZ
gene1 1 2 0
gene2 0 1 0
gene3 1 1 1
gene4 1 1 2
Now I would like to map the gen gain and loss in a phylogenetic tree. Something like this:
I guess I could just record all families of genes shared among the different species in each branch of the tree, but I believe there is a more elegant way of doing that.
Any suggestion?
HI thanks for the answer. Count looks really cool. However, for my understanding it calculates rates of gain, loss, and duplication. I would be interested in displaying the absolute number of gene families gained or loss. Any other suggestion?
Count can give the actual ancestral family sizes, not only the rates. I am not sure I understand what you mean by "absolute number of gene families gained or los[t]". Assuming you want the number of events per branch, then 1) use Count to have the family sizes per gene and then label the appropriate branches (or nodes (*)) as "gain" (family size from zero to non-zero) or "loss" (from non-zero to zero). Therefore, for one gene, each branch can be or a "gain", or a "loss", or nothing. 2) After doing this for all genes using a common tree, go at each branch of this common tree and sum up the number of "gains" (over genes) and you will have the number of gains per branch. (The same applies to losses.)
You can also use other the methods, as suggested by Federico, which may be also available in Count.
(*) For rooted trees there is a one-to-one correspondence between branches and nodes, but we usually interpret the events as happening "somewhere" in the branch.
Thanks very much again for your reply. So basically what you suggest is to go "manually" from one branch (node) to the next and compare if each gene family varies from 0 to > 0 or from > 0 to 0, is that right? For example from node 26 to 25 Family_1 goes from 0 to 1 so it is a "gain". I wondered if there is a tool which does that.
The COUNT is vary cool /. Recently, I focus on the gene family evolution , when using the COUNT software, I don't know how to choose the optimize model (gain and loss ,BDI, or something in the Optimize Rate Panel ), could you give me some suggestion ?