Postgresql: Select Only Taxa Which Are Children Of A Given Taxon Using Parent Taxon Id
3
0
Entering edit mode
12.6 years ago
Tgh ▴ 10

Hi everyone,

I am stuck with the following problem: From a 'BLAST result' table in a database I want to select only those entries which fulfill some specific requirements (e.g. contain a keyword) and are from species that are 'children' of a given taxon 'XY' with the id 123.
E.g., I want all entries which belong to species which themselves belong to the superkingdom Bacteria. The query that takes into account the specific keyword, etc. works fine.
However, I find it very hard to get a good idea how to write the query so that it only takes the 'children' of a taxon.

The table 'taxonomy' was created from the NCBI taxonomy names.dmp and nodes.dmp files. The taxon_id is the NCBI taxonomy ID, while the other IDs are from the database only. It looks like this:

   id    | taxon_id | parent_taxon_id | taxon_name |     rank     
---------+----------+-----------------+------------+--------------
 3963792 |        2 |          131567 | Bacteria   | superkingdom

The 'BLAST result' table contains the foreign key 'taxonomy_id' of the taxonomy table. It looks like this (I removed some columns with more information on the actual BLAST hit):

   id   | score | tax_id |     blast_spec     | taxonomy_id
--------+-------+--------+--------------------+-------------
 850859 |   722 |   9483 | Callithrix jacchus |     3981316
 850860 |   726 |  10141 | Cavia porcellus    |     3982450

My idea would be, to make a FOR loop and iterate over all results that I get and check, whether the 'taxonomy_id' in there has a parent_taxon_id that has the 123 as its parent_taxon_id.

However, I have absolutely no clue how to design such a loop in SQL and how to make sure it does not run forever. I have not yet understood the concept of joins, but these probably would also help in this particular problem?

This is probably a problem that a person with more experience in SQL than me can easily handle..?

I am very thankful for all your advice!

ncbi taxonomy • 5.5k views
ADD COMMENT
0
Entering edit mode

Can you add the table 'BLAST result', and what the foreign keys are?

ADD REPLY
0
Entering edit mode

For 'BLAST result' it says: Foreign-key constraints: "BLASTresult_taxonomy_id_fkey" FOREIGN KEY (taxonomy_id) REFERENCES taxonomy(id)

ADD REPLY
0
Entering edit mode

Why does the blast result table have both a tax_id and a taxonomy_id column?

Ah, taxonomy_id refers to taxonomy(id) while tax_id refers to taxonomy(taxon_id). Having redundancy like this is not the best idea.

ADD REPLY
1
Entering edit mode
12.6 years ago
asjo ▴ 120

One solution is to use the connectby() function from the tablefunc contrib extension. In PostgreSQL 9.1+ you install it simply by going CREATE EXTENSION tablefunc;

With a little example taxonomy tree looking like this:

taxonomy=# SELECT * FROM taxonomy ORDER BY id;
 id | taxon_id | parent_taxon_id |        taxon_name        |  rank   
----+----------+-----------------+--------------------------+---------
  1 |        1 |               1 | root                     | no rank
  2 |        2 |               1 | Bacteria                 | no rank
  3 |        3 |               1 | Eukaryota                | no rank
  4 |        4 |               1 | Archaea                  | no rank
  5 |        5 |              11 | Callithrix jacchus       | species
  6 |        6 |              11 | Cavia porcellus          | species
  7 |        7 |              10 | Bacillus subtilis        | species
  8 |        8 |               9 | Saccharomyces cerevisiae | species
  9 |        9 |               3 | Fungi                    | kingdom
 10 |       10 |               2 | Bacillus                 | genus
 11 |       11 |               3 | Mammalia                 | class
 12 |       12 |               4 | Pyrobaculum              | genus
 13 |       13 |              12 | Pyrobaculum neutrophilum | species
(13 rows)

You can use connectby() like this, to get all the nodes under, say, Bacteria:

taxonomy=# SELECT * FROM connectby('taxonomy', 'taxon_id', 'parent_taxon_id', 'taxon_id', '2', 0, '~') AS t(keyid text, parent_keyid text, level int, branch text, pos int);
 keyid | parent_keyid | level | branch | pos 
-------+--------------+-------+--------+-----
 2     |              |     0 | 2      |   1
 10    | 2            |     1 | 2~10   |   2
 7     | 10           |     2 | 2~10~7 |   3
(3 rows)

Ok, we are ready to combine this with your results - just JOIN a connectby() query returning all records from the taxonomy table located under Bacteria:

Now, with a sample result table looking like this:

taxonomy=# SELECT * FROM result;
 id | score | tax_id |        blast_spec        | taxonomy_id 
----+-------+--------+--------------------------+-------------
  1 |   722 |      5 | Callithrix jacchus       |           5
  2 |   726 |      6 | Cavia porcellus          |           6
  3 |   756 |      7 | Bacillus subtilis        |           7
  4 |   781 |      8 | Saccharomyces cerevisiae |           8
(4 rows)

We can get just the results under Bacteria like this:

taxonomy=# SELECT * from result JOIN (SELECT keyid::int FROM (SELECT * FROM connectby('taxonomy', 'taxon_id', 'parent_taxon_id', 'taxon_id', '2', 0, '~') AS t(keyid text, parent_keyid text, level int, branch text, pos int)) AS bacteria) AS b ON taxonomy_id=keyid ORDER BY score;
 id | score | tax_id |    blast_spec     | taxonomy_id | keyid 
----+-------+--------+-------------------+-------------+-------
  3 |   756 |      7 | Bacillus subtilis |           7 |     7
(1 row)

If we want all the Eukaryote results, we join the connectby() starting at taxon_id 3 instead:

taxonomy=# SELECT * from result JOIN (SELECT keyid::int FROM (SELECT * FROM connectby('taxonomy', 'taxon_id', 'parent_taxon_id', 'taxon_id', '3', 0, '~') AS t(keyid text, parent_keyid text, level int, branch text, pos int)) AS eukaryota) AS e ON taxonomy_id=keyid ORDER BY score;
 id | score | tax_id |        blast_spec        | taxonomy_id | keyid 
----+-------+--------+--------------------------+-------------+-------
  1 |   722 |      5 | Callithrix jacchus       |           5 |     5
  2 |   726 |      6 | Cavia porcellus          |           6 |     6
  4 |   781 |      8 | Saccharomyces cerevisiae |           8 |     8
(3 rows)

And finally if we just want the mammals, we can start from taxon_id 11 in the tree:

taxonomy=# SELECT * from result JOIN (SELECT keyid::int FROM (SELECT * FROM connectby('taxonomy', 'taxon_id', 'parent_taxon_id', 'taxon_id', '11', 0, '~') AS t(keyid text, parent_keyid text, level int, branch text, pos int)) AS mammals) AS m ON taxonomy_id=keyid ORDER BY score;
 id | score | tax_id |     blast_spec     | taxonomy_id | keyid 
----+-------+--------+--------------------+-------------+-------
  1 |   722 |      5 | Callithrix jacchus |           5 |     5
  2 |   726 |      6 | Cavia porcellus    |           6 |     6
(2 rows)

You can also use common table expressions to traverse the tree - I have put that in an alternative answer as Biostars only allows 5000 characters in an answer :-)

ADD COMMENT
1
Entering edit mode
12.6 years ago
asjo ▴ 120

Using the same example data as in my answer using connectby(), here is a solution using commont table expressions instead.

All the results under Bacteria:

taxonomy=# SELECT * FROM result JOIN (WITH RECURSIVE relevant_taxonomy AS (SELECT taxon_id FROM taxonomy WHERE taxon_id=2 UNION ALL SELECT taxonomy.taxon_id FROM taxonomy JOIN relevant_taxonomy ON relevant_taxonomy.taxon_id=taxonomy.parent_taxon_id) SELECT taxon_id FROM relevant_taxonomy) AS subtree ON subtree.taxon_id=result.tax_id;
 id | score | tax_id |    blast_spec     | taxonomy_id | taxon_id 
----+-------+--------+-------------------+-------------+----------
  3 |   756 |      7 | Bacillus subtilis |           7 |        7
(1 row)

All the Eukaryote results:

taxonomy=# SELECT * FROM result JOIN (WITH RECURSIVE relevant_taxonomy AS (SELECT taxon_id FROM taxonomy WHERE taxon_id=3 UNION ALL SELECT taxonomy.taxon_id FROM taxonomy JOIN relevant_taxonomy ON relevant_taxonomy.taxon_id=taxonomy.parent_taxon_id) SELECT taxon_id FROM relevant_taxonomy) AS subtree ON subtree.taxon_id=result.tax_id;
 id | score | tax_id |        blast_spec        | taxonomy_id | taxon_id 
----+-------+--------+--------------------------+-------------+----------
  1 |   722 |      5 | Callithrix jacchus       |           5 |        5
  2 |   726 |      6 | Cavia porcellus          |           6 |        6
  4 |   781 |      8 | Saccharomyces cerevisiae |           8 |        8
(3 rows)

And finally just the mammal results:

taxonomy=# SELECT * FROM result JOIN (WITH RECURSIVE relevant_taxonomy AS (SELECT taxon_id FROM taxonomy WHERE taxon_id=11 UNION ALL SELECT taxonomy.taxon_id FROM taxonomy JOIN relevant_taxonomy ON relevant_taxonomy.taxon_id=taxonomy.parent_taxon_id) SELECT taxon_id FROM relevant_taxonomy) AS subtree ON subtree.taxon_id=result.tax_id;
 id | score | tax_id |     blast_spec     | taxonomy_id | taxon_id 
----+-------+--------+--------------------+-------------+----------
  1 |   722 |      5 | Callithrix jacchus |           5 |        5
  2 |   726 |      6 | Cavia porcellus    |           6 |        6
(2 rows)

I think the CTE - this - solution is more elegant. I would expect it to be more efficient as well, but my test-set is way to small to say.

ADD COMMENT
0
Entering edit mode

Thank you so much for your detailed and also easy to understand answer!! I am going to give it a try on Monday and will get back here to tell if it worked.

ADD REPLY

Login before adding your answer.

Traffic: 2488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6