There is a way to describe a tree topology in a tabular format. In principle, you can keep the tabularized trees in MySQL and perform multiple queries to traverse the tree to get your answer.
Nonetheless, if all you want to do is to get homology from a gene tree, I would recommend to keep all trees in the Newick format in a table and keep the gene-tree relationship in another table. Upon a query, you pull the tree from the tree table and run a program to get the orthology/paralogy information. Inferring this information can be done in linear time. It will be much faster than lots of MySQL queries.
Things are more complicated in treefam, though, where we infer orthology confidence by bootstrapping alignments and rebuilding trees for 100 rounds. Certainly we cannot afford 100 rounds of tree building for each query. We still keep gene pairs, but to avoid a huge table, we only keep orthologous and within-species paralogous pairs. In my view, these are the most frequent queries. Any other homology are cross-species paralogy. We can just pull the tree to get them, without confidence value, though.
In treefam, we keep orthogous and within-species paralogous pairs in the following table:
CREATE TABLE `ortholog` (
`idx1` int NOT NULL default '0',
`idx2` int NOT NULL default '0',
`taxon_id` int unsigned NOT NULL default '0',
`bootstrap` smallint unsigned NOT NULL default '0',
PRIMARY KEY (`idx1`,`idx2`),
KEY `idx2` (`idx2`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COMMENT='ortholog/paralog'
where idx1
and idx2
are integer gene IDs, pointing to the gene table, and taxon_id
indicates the (ancestral) species of the last common ancestor (LCA) of idx1 and idx2, inferred from the gene tree containing idx1 and idx2. For example, for a gene tree (d-gorilla,((a-human,a-chimp),(b-human,c-human)))
, we keep the following pairs: pairs between d-gorilla and the other 4 genes, (a-human,a-chimp) and (b-human,c-human). This table has 7 million records for ~15-20 species in treefam. In principle, this table also works with cross-species paralogs, but you will get a huge table, which I think is not worthwhile.
Roderic Page ( @rdmpage) should have some nice ideas about this.