Get a reference phylogenetic tree of known taxa from GTDB
2
0
Entering edit mode
12 months ago
Gio • 0

Hello,

I have a set of genomes I downloaded from NCBI. I would like to make a reference phylogenetic tree where only they appear.

Instead of aligning them or using mash distance to make my own tree, is there a way I can simply provide the genomes or taxa to GTDB and get a tree back from it?

phylogenetic-tree bacteria gtdb • 1.7k views
ADD COMMENT
1
Entering edit mode
12 months ago
Mensur Dlakic ★ 28k

To the best of my knowledge, there is no way to do this using GTDB website. However, it can be done locally using the GTDB-toolkit. You select a group of genomic sequences and run it through the program. It will do gene predictions, single-copy marker identification, and a taxonomic assignment that will include a global GTDB tree with your organisms (too big for most applications) plus a tree with just your organisms.

https://github.com/Ecogenomics/GTDBTk

ADD COMMENT
1
Entering edit mode
11 months ago
fredjaya ▴ 20

If I understand correctly, given a list of the NCBI genome IDs, you can prune a .tree (e.g. the GTDB tree) with Biopython. Note that the NCBI IDs must match the taxa labels in the tree.

Example script here.

ADD COMMENT
1
Entering edit mode

Nicely done. Yet pruning the tree will not suffice for the user-specific entries that may not already be in the tree.

ADD REPLY
0
Entering edit mode

Good point. I think it's possible to "place" tips via IQ-TREE2's contrained tree search option.

iqtree -s user_sequences.fa -g pruned_gtdb.tree

Here the starting tree is fixed to the pruned tree, and will infer the whole tree to include the "new" user-specific sequences.

ADD REPLY
0
Entering edit mode

I think it is highly likely that most (or all) user-specified NCBI genomes can be found in GTDB. The reference GTDB tree encompasses approximately 80,000 bacterial species, each represented by a single genome (representative genome for species). However, the complete GTDB database comprises around 400,000 genomes, including both representative and non-representative genomes. To access the full genome list, including NCBI identifiers, the user can download the metadata file from GTDB at https://data.gtdb.ecogenomic.org/releases/latest/. This file provides details for all 400,000 genomes present in the GTDB database.

ADD REPLY

Login before adding your answer.

Traffic: 1663 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6