I would like to verify if all the protein in specific pathways (e.g. secretion systems, two component regulatory system) are present in a genome.
Is there a way to do this automatically iterating the search for many genomes?
I would like to verify if all the protein in specific pathways (e.g. secretion systems, two component regulatory system) are present in a genome.
Is there a way to do this automatically iterating the search for many genomes?
Cleaned up the post and moved to a Gist by Ram on 22-Apr-2022.
A combination of KEGG's API, blastp and scripting should do it. If you're doing more small scale stuff, it might be feasible to just use http://www.kegg.jp/blastkoala/ and then the mapping tools provided at the site..
Thanks for the answer. I tried that as well. But I was just wondering how sensitive this is. In blast KOALA the search is done using as reference the information for a Genus, and the information available for the genus I am working with are quite limited. Therefore, will I miss some protein known in other taxon to be involved in a specific process but not described in my genus?
I don't think so:
The database files are generated from KEGG GENES as a collection of representative genomes by removing similar organisms at the species, genus or family level. When multiple members are present in each species/genus/family group, the first genome is taken as a representative genome. When the other members in the group contain different K numbers that are not present in the representative genome, those genes are added as if they are present in additional chromosomes or plasmids.
Stated right there in the blast koala site..
I'd take in a list of proteins as query and BLASTP against organism-filtered NR, verify identity and similarity in top 3 hits per query to ensure they fall under a threshold.
You might want to check the methods of this paper: Long-term phenotypic evolution of bacteria : Nature : Nature Publishing Group
They used Flux Balance analysis to determine the substrates a bacterial strain can grow on.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Wow this looks like a really detailed work flow. I will give it a try but I am a bit dumb with the KEGG API. Any tutorial I could use to learn it?
Hi dago,
I figured it out by hit and trial method. Bioconductor's KEGGREST could be a good starting point to explore KEGG REST service further. Please also have a look at the HUManN workflow on how they incorporated MinPath. Here are a few more one-liners that you can use on a GENBANK file to understand the KEGG service better:
The API uses KEGG ENZYME database, which is an implementation of the Enzyme Nomenclature (EC Numbers) on the ExplorEnz database, and is maintained in the KEGG LIGAND relational database with additional annotation of reaction hierarchy, organism information, and sequence data links.
To use these one-liners on your GENBANK files, replace test.gbk with the name of the file you are using. First step is to extract a tab-separated list of only those contigs (once you have annotated them through PROKKA) which have enzymes in them. All other contigs are ignored
The following one-liner extracts the list of all enzymes found in the GENBANK file and uses rest-style KEGG API to generate names from EC numbers:
For the extracted enzymes, we can list all the KEGG Ortholog (KO) groups each enzyme is part of, along with their detailed description. The KO system is the basis for representation for all proteins and functional RNAs that correspond to KEGG pathway nodes, BRITE hierarchy nodes, and KEGG module nodes:
Similarly, for the extracted enzymes, we can also list all the known reactions these enzymes are a part of. Each reaction is identified by the R number and is linked to ortholog groups of enzymes enabling integrated analysis of genomic and chemical information.
We may also be interested in knowing which other organisms contain the same enzymes:
Best Wishes,
Umer