You can use MinPath to see if all the enzymes are sufficient for a given Pathway to exist (the other way around). This is what I will do:
If I start with contigs or a genome sequence (can be a file with multiple genomes in it), I will use a tool such as PROKKA to generate a genbank file of my annotation as well as all the extracted CDS regions as a FASTA file with amino acid sequences.
Say I start with the following file:
$ head Sample_1.fasta
>PROT1
AAPAAAAPAAAPAAQAAAPAAAGGDDAILSPAARKLAEEAGIDPNSIAGTGKGGRVTKEDVVAAVEAKKNAPAALAKPAAPAAEAPIFAAGDRVEKRVPMTRLRAKVAERLVEAQSAMAMLTTFNEVNMKPIMDLRSKYKDLFEKKHNGVRLGFMSFFVKAATEALKRFPGVNASIDGNDIVYHGYQDIGVAVSSDRGLVVPVLRNAEFMSLAEIEGGIANFGKKAKEGKLTIEDMTGGTFTISNGGVFGSLLSTPIVNPPQTAILGMHKIQERPMAVNGQVVILPMMYLALSYDHRLIDGKEAVSFLVAIKDLLEDPARLLLDV
>PROT2
AAPAAAGGDDAILSPAARKLAEEAGIDPNSIAGTGKGGRVTKEDVVAAVEAKKNAPAAPAKPAAPAAEAPIFAAGDRVEKRVPMTRLRAKVAERLVEAQSAMAMLTTFNEVNMKPIMDLRSKYKDLFEKKHNGVRLGFMSFFVKAATEALKRFPGVNASIDGNDIVYHGYQDIGVAVSSDRGLVVPVLRNAEFMSLAEIEGGIANFGKKAKEGKLTIEDMTGGTFTISNGGVFGSLLSTPIVNPPQTAILGMHKIQERPMAVNGQVVILPMMYLALSYDHRLIDGKEAVSFLVAIKDLLEDPARLLLDV
>PROT3
AKSKFERNKPHVNIGTIGHVDHGKTSLTAAITKYFGEFKAYDQIDAAPEEKARGITISTAHVEYETPNRHYAHVDCPGHADYVKNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVPAIVVFLNKVDQVDDAELLELVELEVRELLSSYEFPGDDIPIVKGSALAALEDSDKKIGEDAIRELMAAVDAYIPTPERPIDQPFLMPIEDVFSISGRGTVVTGRVERGIVKVGEEIEIVGIRPTTKTTCTGVEMFRKLLDQGQAGDNIGALLRGVDRNGVERGQILCKPGSVKPHRKFKAEAYILTKEEGGRHTPFFTNYRPQFYFRTTDVTGIVTLPEGTEMVMPGDNVTVDVELIVPIAMEEKLRFAIREGGRTVGAGIVASIVE
>PROT4
APAAAAPAAAPAAQAAAPAAAGGDDAILSPAARKLAEEAGIDPNSIAGTGKGGRVTKEDVVAAVEAKKNAPAAPAKPAAPAAEAPIFAAGDRVEKRVPMTRLRAKVAERLVEAQSAMAMLTTFNEVNMKPIMDLRSKYKDLFEKKHNGVRLGFMSFFVKAATEALKRFPGVNASIDGNDIVYHGYQDIGVAVSSDRGLVVPVLRNAEFMSLAEIEGGIANFGKKAKEGKLTIEDMTGGTFTISNGGVFGSLLSTPIVNPPQTAILGMHKIQERPMAVNGQVVILPMMYLALSYDHRLIDGKEAVSFLVAIKDLLEDPARLLLDV
>PROT5
ATLLAQAIIREGMKNVTAGANPMVVRKGIQDAVDAAVEALHKNSKKVTCSEDIARVATISAGDETVGKLIAEAMEKVTSDGVITVEESKTSETYSEVVEGMMFDRGYISPYMATDTEKMEAIMDDAYLLITDKKISNIQEILPLLEKIVQSGKKLVIIAEDVEGEALTTLILNKLRGTFNCAAV
Next, we use RAPsearch to search our protein sequences against the KEGG database and get a tab-delimited blast output file (identified by *.out.m8
ending). To speed up RAPsearch, you can specify the number of cores to run on using -z
switch.
$ rapsearch -d /PATH_TO_KEGG_DB/kegg_db -z 2 -q Sample_1.fasta -o Sample_1.out
$ head *.out.m8
# RAPSearch
# Job submitted: Thu Jul 3 17:57:51 2014
# Query : Sample_1.fasta
# Subject : /PATH_TO_KEGG_DB/kegg_db
# Fields: Query Subject identity aln-len mismatch gap-openings q.start q.end s.start s.end log(e-value) bit-score
PROT1 psa:PST_1876 89.2308 325 35 0 1 325 84 408 -156.9 557.37
PROT1 pmy:Pmen_2502 84.7095 327 48 1 1 325 84 410 -149.71 533.49
PROT1 psb:Psyr_2010 78.7692 325 67 1 1 325 89 411 -137.65 493.43
PROT1 hch:HCH_04744 73.2143 336 74 4 2 325 80 411 -130.46 469.54
PROT1 sde:Sde_2105 71.6049 324 89 2 5 325 80 403 -124.55 449.9
We can then identify enzymes in our dataset. For this purpose, we use a REST-style KEGG API to link KEGG entries (2nd column in the output file) to EC numbers:
$ awk -F"\t" '!/#/{print $1","$2}' Sample_1.out.m8 | while IFS=$"," read -r -a mA; do echo -e "${mA[0]}\t$(curl -s http://rest.kegg.jp/link/ec/${mA[1]})"; done | awk '{print $1"\t"$3}' > ec_assignments.txt
$ head ec_assignments.txt
PROT1 ec:2.3.1.61
PROT1 ec:2.3.1.61
PROT1 ec:2.3.1.61
PROT1 ec:2.3.1.61
PROT1 ec:2.3.1.61
PROT1 ec:2.3.1.61
PROT1 ec:2.3.1.61
PROT1 ec:2.3.1.61
PROT1 ec:2.3.1.61
PROT1 ec:2.3.1.61
RAPsearch reports multiple matches for a single protein sequence. We assign the most abundant EC number to each protein sequence:
$ sort ec_assignments.txt | uniq -c | grep -i "ec:" | awk '{gsub("^ec:","",$3);if(ec[$2]){if($1>=c[$2]){ec[$2]=$3;c[$2]=$1}} else {ec[$2]=$3;c[$2]=$1}} END{for (i in ec) print i"\t"ec[i]}' > Sample_1.ec
$ tail Sample_1.ec
PROT1 2.3.1.61
PROT78 4.2.1.11
PROT2 2.3.1.61
PROT136 1.1.1.86
PROT137 1.12.7.2
PROT4 2.3.1.61
PROT138 2.7.9.1
PROT139 3.2.1.23
PROT95 1.1.1.86
PROT9 3.6.3.14
We next construct parsimonious pathways using MinPath as it does not attempt to reconstruct entire pathways from a given set of protein sequences identified in a protemics/genomics experiment, but determines the minimal set of biological pathways that must exist in the biological system to explain the input protein sequences. It takes a set of reference pathways and a set of proteins that can be mapped to one or more pathways, and finds the minimum number of pathways that can explain all the protein functions. MinPath comes with a mapping file ec2path that lists the reference pathways from MetaCyc and all the EC numbers associated with them. MetaCyc serves as an encyclopedia of metabolism containing more than 2151 patways from more than 2500 different organisms.
$ head -50 ec2path
#Metacyc pathway and ec mapping file
#Pathway EC
PWY0-1479 3.1.26.12
PWY0-1479 2.7.7.56
PWY0-1479 3.1.13.5
PWY0-1479 3.1.26.5
PWY-6146 4.2.1.11
PWY-6146 5.4.2.1
PWY-6146 1.2.7.6
PWY-6146 3.5.4.9
PWY-6146 6.4.1.1
PWY-6146 1.2.7.1
PWY-6146 2.6.1.1
PWY-6146 2.6.1.2
PWY-6146 1.1.1.261
...
To construct pathways from the generated EC file using MetaCyc pathways (using default mapping file in MinPath), run it as follows:
$ MinPath1.2.py -any Sample_1.ec -map ec2path -report Sample_1.ec.report -details Sample_1.ec.details
$ head Sample_1.ec.report
path 2 any n/a naive 1 minpath 0 fam0 39 fam-found 2 name PWY-6146
path 4 any n/a naive 1 minpath 1 fam0 8 fam-found 1 name BRANCHED-CHAIN-AA-SYN-PWY
path 11 any n/a naive 1 minpath 1 fam0 9 fam-found 1 name PWY-821
path 44 any n/a naive 1 minpath 1 fam0 1 fam-found 1 name GLUTSYNIII-PWY
path 46 any n/a naive 1 minpath 0 fam0 6 fam-found 1 name PWY-5505
path 49 any n/a naive 1 minpath 1 fam0 8 fam-found 1 name PWY-6549
path 53 any n/a naive 1 minpath 0 fam0 13 fam-found 1 name THREOCAT-PWY
path 55 any n/a naive 1 minpath 0 fam0 5 fam-found 1 name ILEUSYN-PWY
path 56 any n/a naive 1 minpath 0 fam0 11 fam-found 1 name PWY-3001
path 57 any n/a naive 1 minpath 0 fam0 6 fam-found 1 name PWY-5101
The last column reports the pathways, which you can search on MetaCyc website to get the detailed information. For example, here is the record for BRANCHED-CHAIN-AA-SYN-PWY on the website:
http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Metaproteomics/metacyc.png" style="height:410px; width:600px" />However, we are interested in KEGG pathways so that we can draw metabolic networks using iPath. We, therefore, generate a new mapping file using the REST-style KEGG API:
$ for I in $(curl -S http://rest.kegg.jp/list/pathway | awk -F"\t" '{print $1}'); do curl -S http://rest.kegg.jp/link/ec/$i; done > ec2kegg
$ awk -F"\t" '!/^$/{gsub("^path:","",$1);gsub("^ec:","",$2);print $1"\t"$2}' ec2kegg > ec2kegg_final
$ chmod +x ec2kegg_final
$ head -50 ec2kegg_final
map00010 1.1.1.1
map00010 1.1.1.2
map00010 1.1.1.27
map00010 1.1.2.7
map00010 1.1.2.8
map00010 1.2.1.12
map00010 1.2.1.3
map00010 1.2.1.5
map00010 1.2.1.59
map00010 1.2.1.9
...
Now run MinPath1.2.py
with the new mapping file using -map ec2kegg_final
in the command:
$ MinPath1.2.py -any Sample_1.ec -map ec2kegg_final -report Sample_1.ec.report.kegg -details Sample_1.ec.details.kegg
$ head Sample_1.ec.report.kegg
path 1 any n/a naive 1 minpath 0 fam0 45 fam-found 3 name map00010
path 2 any n/a naive 1 minpath 0 fam0 24 fam-found 1 name map00020
path 3 any n/a naive 1 minpath 0 fam0 50 fam-found 1 name map00030
path 4 any n/a naive 1 minpath 0 fam0 63 fam-found 1 name map00040
path 5 any n/a naive 1 minpath 0 fam0 70 fam-found 2 name map00051
path 6 any n/a naive 1 minpath 0 fam0 46 fam-found 1 name map00052
path 9 any n/a naive 1 minpath 0 fam0 13 fam-found 1 name map00062
path 10 any n/a naive 1 minpath 0 fam0 30 fam-found 2 name map00071
path 18 any n/a naive 1 minpath 0 fam0 11 fam-found 2 name map00190
path 19 any n/a naive 1 minpath 0 fam0 3 fam-found 1 name map00195
To get the description of the reported KEGG pathways, use the following one-liner:
$ for I in $(awk '{print $14}' Sample_1.ec.report.kegg); do curl -S http://rest.kegg.jp/find/pathway/$i; done
path:map00010 Glycolysis / Gluconeogenesis
path:map00020 Citrate cycle (TCA cycle)
path:map00030 Pentose phosphate pathway
path:map00040 Pentose and glucuronate interconversions
path:map00051 Fructose and mannose metabolism
path:map00052 Galactose metabolism
path:map00062 Fatty acid elongation
path:map00071 Fatty acid degradation
path:map00190 Oxidative phosphorylation
path:map00195 Photosynthesis
path:map00230 Purine metabolism
path:map00240 Pyrimidine metabolism
path:map00250 Alanine, aspartate and glutamate metabolism
path:map00280 Valine, leucine and isoleucine degradation
path:map00281 Geraniol degradation
path:map00290 Valine, leucine and isoleucine biosynthesis
path:map00310 Lysine degradation
path:map00330 Arginine and proline metabolism
path:map00360 Phenylalanine metabolism
path:map00362 Benzoate degradation
path:map00380 Tryptophan metabolism
path:map00410 beta-Alanine metabolism
path:map00511 Other glycan degradation
path:map00531 Glycosaminoglycan degradation
path:map00592 alpha-Linolenic acid metabolism
path:map00600 Sphingolipid metabolism
path:map00604 Glycosphingolipid biosynthesis - ganglio series
path:map00620 Pyruvate metabolism
path:map01120 Microbial metabolism in diverse environments
...
We want to use interactive Pathways Explorer (iPath), which is a web-based tool for the visualization, analysis and customization of the various pathways maps. The metabolic pathways are constructed using 146 KEGG pathways, and give an overview of the complete metabolism in biological systems. We first extract the pathways in the format to be used with iPath and give each of them red color:
$ awk '{gsub("^map","",$14);print $14" #ff0000"}' Sample_1.ec.report.kegg > Sample_1.ec.report.kegg.ipath
Now go to http://pathways.embl.de/index.html and click on iPath v2 interface to load the applet:
http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Metaproteomics/ipath.png" style="height:313px; width:600px" />Next click on the "Customize" button to load the "Customization and data mapping" tab, and upload Sample_1.ec.report.kegg.ipath
by clicking on "Select file" button. This will populate the "Element selection" text area. You can manually assign different colors to your pathways here. Finally, click on "Submit data and customize maps" button to generate your metabolic network.
Once generated, you can save the metabolic network as a PDF document to your local computer. The connections in red are the ones present in your sample:
http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Metaproteomics/Sample_1.png" style="height:366px; width:600px" />Best Wishes,
Umer
Wow this looks like a really detailed work flow. I will give it a try but I am a bit dumb with the KEGG API. Any tutorial I could use to learn it?
Hi dago,
I figured it out by hit and trial method. Bioconductor's KEGGREST could be a good starting point to explore KEGG REST service further. Please also have a look at the HUManN workflow on how they incorporated MinPath. Here are a few more one-liners that you can use on a GENBANK file to understand the KEGG service better:
The API uses KEGG ENZYME database, which is an implementation of the Enzyme Nomenclature (EC Numbers) on the ExplorEnz database, and is maintained in the KEGG LIGAND relational database with additional annotation of reaction hierarchy, organism information, and sequence data links.
To use these one-liners on your GENBANK files, replace test.gbk with the name of the file you are using. First step is to extract a tab-separated list of only those contigs (once you have annotated them through PROKKA) which have enzymes in them. All other contigs are ignored
The following one-liner extracts the list of all enzymes found in the GENBANK file and uses rest-style KEGG API to generate names from EC numbers:
For the extracted enzymes, we can list all the KEGG Ortholog (KO) groups each enzyme is part of, along with their detailed description. The KO system is the basis for representation for all proteins and functional RNAs that correspond to KEGG pathway nodes, BRITE hierarchy nodes, and KEGG module nodes:
Similarly, for the extracted enzymes, we can also list all the known reactions these enzymes are a part of. Each reaction is identified by the R number and is linked to ortholog groups of enzymes enabling integrated analysis of genomic and chemical information.
We may also be interested in knowing which other organisms contain the same enzymes:
Best Wishes,
Umer