Suitable Database Management System For Storing/Querying Phylogenetic Trees
4
8
Entering edit mode
11.9 years ago
Fabsta ▴ 120

Hi guys,

I am interested in alternative ways of storing phylogenetic trees in a database.

Currently, we represent trees as nested sets in a MySQL database. The trees are gene trees of animal species (~16,000) and give homology(orthology/parology) information for each gene. Since homology is a pairwise relationship, we currently have 100 million pairwise homologies stored in our db.

This makes tables big and queries slow.

Since trees are graphs and there seem to be some graph-based database management systems available, I was wondering if anyone has used/experience with storing/querying phylogenetic trees in graph-based databases?

Any tipps/links are helpful!

Thanks a lot and enjoy your weekend

database graph tree • 4.1k views
ADD COMMENT
2
Entering edit mode

Roderic Page ( @rdmpage) should have some nice ideas about this.

ADD REPLY
5
Entering edit mode
11.9 years ago
Roderic Page ▴ 390

Do I understand you correctly? For each pair of genes in a tree you store the relationship between that pair, so that for a tree with 100 sequences you store 100x100 = 10,000 pairs of values. This seems terribly wasteful. Why not traverse the tree, colouring each node by orthology cluster. For example, given the tree for six genes ((a1,(a2,a3)),(b1,(b2,b3)) for species (1,(2,3)) the gene tree would be coloured ((a,(a,a)),(b,(b,b)). Then store the gene colours as an additional column in the database table that lists the genes:

gene_id    | tree_id  | colour
  a1       |     1    |    a
  a2       |     1    |    a
  a3       |     1    |    a
  b1       |     1    |    b
  b2       |     1    |    b
  b3       |     1    |    b

The orthology of two genes is a simple lookup of each gene and comparison of their "colour". I think the trick to storing and querying trees in MySQL is to preprocess them externally to generate something MySQL can index.

ADD COMMENT
1
Entering edit mode

But how would you color this gene tree: (ab1,((a2,a3),(b2,b3)))? Tag gene ab1 with two colors? Tag the entire tree with one color? Or tag three major clades with 3 colors?

ADD REPLY
1
Entering edit mode

Of the top of my head I would assign different colour to ab, a, and b. This raises the issue that a <> b is not the same relationship as a <> ab. You could encode this in the colouring, e.g. ab=11, a=10, b=01 and if the logical AND of the colours is 0 then the genes are paralogs. To generalise this, if the colouring encodes the subtree of duplications then the test for paralogy is whether the colours are different and do not encode nodes on an ancestor-descendant path on the tree.

ADD REPLY
0
Entering edit mode

I am still not sure how it works in general. Say the species tree is ((1,2),(3,4)) and we have a gene tree (((a1,a2),(b1,b2)),((c3,c4),(d3,d4))). How would you color? Even if it works, given a huge tree with hundreds of duplication events, hundreds of atomic colors may be necessary. You need big integer support, or to color leaves with a set of integers. Keeping compound colors still requires much more storage than keeping a tree in the NH format. Furthermore, your scheme only works for orthologs. It discards the date of a pair of paralogs or at least complicates the retrieval of the date which we mostly care about. For example, the human CCNE genes (Cyclin E) fit such a gene tree: (CCNE-ciona,((CCNE1-human,CCNE1-fish),(CCNE2-human,CCNE2-fish))). You need to check all the colors to find out the date of the duplication. Doing this is more complicated than parsing the reconciled tree. Retrieving multiple records may also be slower.

Given all these complications, I think it is more convenient and efficient to simply keep the reconciled tree in the database and write a program to retrieve orthologs/paralogs upon query. Say the query is "find all the orthologs and paralogs of gene A". We do the following. 1) retrieve the tree containing A; 2) parse the tree or subtree; 3) check each internal node on the path between A and root: if the node is speciation, the sister clade is ortholog; if the node is duplication, the sister clade is paralog and we know the date. This procedure is simple and straightforward. As we need to keep trees anyway, it does not require additional storage. The speed bottleneck is tree parsing, but a fast parser in C will still be better than multiple SQL queries.

I thought the above when I built treefam. However, we did not use this approach in the end because it cannot keep bootstrapping values for each pair.

ADD REPLY
5
Entering edit mode
11.9 years ago

Check out Neo4j: http://www.neo4j.org/ and bio4j: http://bio4j.com/

I've personally never used it. But they seem to be getting some good press.

ADD COMMENT
4
Entering edit mode
11.9 years ago

I guess you did this already - or are (entirely appropriately!) are checking first here for a quick answer that might be useful before trawling through the info on how others have done it yourself :)

However, if BioStar or something similar wasn't available to address this question, then I'd begin by checking out how other similar resources deal with this kind of problem e.g.:

Thought I'd flag these up in case they might be useful/interesting for others.

ADD COMMENT
2
Entering edit mode
11.9 years ago
lh3 33k

There is a way to describe a tree topology in a tabular format. In principle, you can keep the tabularized trees in MySQL and perform multiple queries to traverse the tree to get your answer.

Nonetheless, if all you want to do is to get homology from a gene tree, I would recommend to keep all trees in the Newick format in a table and keep the gene-tree relationship in another table. Upon a query, you pull the tree from the tree table and run a program to get the orthology/paralogy information. Inferring this information can be done in linear time. It will be much faster than lots of MySQL queries.

Things are more complicated in treefam, though, where we infer orthology confidence by bootstrapping alignments and rebuilding trees for 100 rounds. Certainly we cannot afford 100 rounds of tree building for each query. We still keep gene pairs, but to avoid a huge table, we only keep orthologous and within-species paralogous pairs. In my view, these are the most frequent queries. Any other homology are cross-species paralogy. We can just pull the tree to get them, without confidence value, though.

In treefam, we keep orthogous and within-species paralogous pairs in the following table:

CREATE TABLE `ortholog` (
  `idx1` int NOT NULL default '0',
  `idx2` int NOT NULL default '0',
  `taxon_id` int unsigned NOT NULL default '0',
  `bootstrap` smallint unsigned NOT NULL default '0',
  PRIMARY KEY  (`idx1`,`idx2`),
  KEY `idx2` (`idx2`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COMMENT='ortholog/paralog'

where idx1 and idx2 are integer gene IDs, pointing to the gene table, and taxon_id indicates the (ancestral) species of the last common ancestor (LCA) of idx1 and idx2, inferred from the gene tree containing idx1 and idx2. For example, for a gene tree (d-gorilla,((a-human,a-chimp),(b-human,c-human))), we keep the following pairs: pairs between d-gorilla and the other 4 genes, (a-human,a-chimp) and (b-human,c-human). This table has 7 million records for ~15-20 species in treefam. In principle, this table also works with cross-species paralogs, but you will get a huge table, which I think is not worthwhile.

ADD COMMENT

Login before adding your answer.

Traffic: 2053 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6