Question

Ortholog clusters in Bacteria

0

Entering edit mode

3.2 years ago

Andrzej Zielezinski 11k

Hi,

Are there any databases/initiatives that compute ortholog clusters in Bacteria? The NCBI COG database covers only 1,300 bacterial species. I am looking for something more comprehensive, ideally to cover all 45,000 bacterial species from Genome Taxonomy Database (GTDB).

Thanks!

orthologs gene family bacteria GtDB • 1.0k views

ADD COMMENT • link updated 3.2 years ago by Mensur Dlakic ★ 29k • written 3.2 years ago by Andrzej Zielezinski 11k

1

Entering edit mode

I think eggNOG is the largest database with precomputed ortholog groups from 4,400 representative bacterial species

ADD REPLY • link 3.2 years ago by andres.firrincieli 3.9k

score 2 · Accepted Answer · 2022-02-23

COG database was last updated in 2020, which you probably know. Just for the sake of other readers:

https://academic.oup.com/nar/article/49/D1/D274/5964069

As they indicate in the section titled "Expanded genome coverage", it was too computationally intensive to study all available genomes and MAGs - the majority of both bacteria and archaea in GTDB are MAGs. It was too demanding to study even complete genomes, so they settled on a smaller group. Given the number of COGs (4877), I think that group is representative of both kingdoms even though it is based on a relatively small fraction of the total.

There is some evidence of novel and unusual protein families in more recently discovered bacteria:

Still, it is a safe bet that many of those are just divergent variants of known proteins, as it is unlikely that these groups evolved thousands of protein families different from all other bacteria.

The first reference above has more ideas about doing this kind of analysis if you want to try it on your own, which I don't recommend. I have done it for about 5,000 MAGs, and it is very difficult to set up and takes a great deal of time, memory and general resources. If you are not easily dissuaded, this may help:

https://github.com/raphael-upmc/proteinClusteringPipeline