Hi Phil,
I might not have definite answer but I can point you in the right direction. You can actually use rest-style KEGG API. The API uses KEGG ENZYME database, which is an implementation of the Enzyme Nomenclature (EC Numbers) on the ExplorEnz database, and is maintained in the KEGG LIGAND relational database with additional annotation of reaction hierarchy, organism information, and sequence data links.
HOWEVER, I'll start with an annotated genbank file that you may create using an annotation tool such as PROKKA on your metagenomic contigs. Moreover, I'll start off with enzymes, rather than species and families (Probably you can construct your queries in a similar manner by exploring those features that I have not done so and by writing similar one-liners).
If I want to extract a tab-separated list of only those contigs which have enzymes in them (all other contigs are ignored):
$ awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","ec:\\1","g")}/^LOCUS/{print ";"$2}' test.gbk | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}'
NODE_2925_length_923_cov_1.698808 ec:4.2.1.17
NODE_4520_length_55408_cov_7.328635 ec:1.11.1.6 ec:1.2.1.22 ec:1.1.99.1 ec:2.5.1.18
NODE_6043_length_823_cov_1.500607 ec:6.2.1.25
NODE_918_length_270_cov_10.744445 ec:1.5.3.1
NODE_312_length_2459_cov_13.763318 ec:4.3.1.17
NODE_282_length_4935_cov_13.040932 ec:2.7.6.5 ec:3.1.7.2 ec:2.7.7.6 ec:2.7.4.8
NODE_3153_length_49002_cov_8.110995 ec:1.4.3.19 ec:4.1.3.39 ec:5.1.1.8 ec:3.5.4.22 ec:1.2.1.26 ec:1.1.1.46 ec:1.1.1.1 ec:4.1.1.28
NODE_241_length_4648_cov_13.109509 ec:4.2.1.20
NODE_533_length_2973_cov_13.546249 ec:4.2.1.20
NODE_2986_length_180_cov_3.444444 ec:1.2.1.38
NODE_2957_length_228_cov_5.881579 ec:4.2.1.52
NODE_2313_length_214_cov_3.420561 ec:1.14.11.17
NODE_2177_length_8710_cov_10.954764 ec:4.1.1.36 ec:6.3.2.5 ec:3.6.1.23 ec:5.4.2.8 ec:5.4.2.2 ec:2.7.2.8 ec:2.4.2.10
NODE_2125_length_194_cov_3.505155 ec:1.2.1.27
NODE_195_length_3240_cov_14.628086 ec:4.2.1.75 ec:4.2.1.75
NODE_395_length_847_cov_14.246754 ec:2.1.1.45
NODE_263_length_1894_cov_12.784055 ec:1.8.4.11
NODE_2068_length_256_cov_5.703125 ec:6.3.2.10
NODE_5094_length_82060_cov_7.647209 ec:5.2.1.8 ec:3.4.21.92 ec:2.5.1.54 ec:3.1.1.3 ec:5.2.1.8 ec:1.1.1.30
NODE_1254_length_19746_cov_10.331156 ec:1.6.5.5 ec:1.5.1.29
NODE_4044_length_223_cov_4.789237 ec:1.1.99.1
NODE_2933_length_5310_cov_10.508098 ec:3.4.13.19
NODE_2296_length_260_cov_4.046154 ec:2.7.1.24
NODE_618_length_171_cov_3.111111 ec:2.3.1.31
NODE_4435_length_1843_cov_2.593597 ec:6.1.1.9
NODE_1478_length_868_cov_12.869816 ec:2.4.2.17
NODE_518_length_1838_cov_14.368879 ec:2.1.1.176 ec:2.1.2.9
NODE_4252_length_45804_cov_8.195944 ec:1.11.1.6 ec:1.11.1.7 ec:2.7.1.23 ec:6.2.1.3 ec:4.1.1.18 ec:2.7.7.7 ec:3.1.26.4
If I want to extract the list of all enzymes found in the genbank file and uses rest-style KEGG API to generate names from EC numbers
$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}' test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/find/enzyme/$i) | awk '{$1=$1"\t"}1'; done
ec:1.1.1.108 carnitine 3-dehydrogenase
ec:1.11.1.15 peroxiredoxin; thioredoxin peroxidase; tryparedoxin peroxidase; alkyl hydroperoxide reductase C22; AhpC; TrxPx; TXNPx; Prx; PRDX
ec:1.1.1.158 Transferred to 1.3.1.98
ec:1.11.1.6 catalase; equilase; caperase; optidase; catalase-peroxidase; CAT
ec:1.1.1.169 2-dehydropantoate 2-reductase; 2-oxopantoate reductase; 2-ketopantoate reductase; 2-ketopantoic acid reductase; ketopantoate reductase; ketopantoic acid reductase
ec:1.11.1.7 peroxidase; lactoperoxidase; guaiacol peroxidase; plant peroxidase; Japanese radish peroxidase; horseradish peroxidase (HRP); soybean peroxidase (SBP); extensin peroxidase; heme peroxidase; oxyperoxidase; protoheme peroxidase; pyrocatechol peroxidase; scopoletin peroxidase; Coprinus cinereus peroxidase; Arthromyces ramosus peroxidase
ec:1.11.1.9 glutathione peroxidase; GSH peroxidase; selenium-glutathione peroxidase; reduced glutathione peroxidase
ec:1.1.1.205 IMP dehydrogenase; inosine-5'-phosphate dehydrogenase; inosinic acid dehydrogenase; inosinate dehydrogenase; inosine 5'-monophosphate dehydrogenase; inosine monophosphate dehydrogenase; IMP oxidoreductase; inosine monophosphate oxidoreductase ec:1.2.1.14 Transferred to 1.1.1.205
ec:1.1.1.215 gluconate 2-dehydrogenase; 2-keto-D-gluconate reductase; 2-ketogluconate reductase
For the extracted enzymes, we can list all the KEGG Ortholog (KO) groups each enzyme is part of, along with their detailed description. The KO system is the basis for representation for all proteins and functional RNAs that correspond to KEGG pathway nodes, BRITE hierarchy nodes, and KEGG module nodes:
$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}' test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/link/ko/ec:$i | grep -Po '(?<=ko:).*' | sed -e "s/K/ko:K/g") | xargs -n 1 | xargs -I {} curl -s http://rest.kegg.jp/find/ko/{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:1.10.2.2 ko:K00411 UQCRFS1, RIP1, petA; ubiquinol-cytochrome c reductase iron-sulfur subunit [EC:1.10.2.2]
ec:1.1.1.1 ko:K00001 E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
ec:1.1.1.1 ko:K00121 frmA, ADH5, adhC; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ec:1.1.1.1 ko:K04072 adhE; acetaldehyde dehydrogenase / alcohol dehydrogenase [EC:1.2.1.10 1.1.1.1]
ec:1.1.1.1 ko:K11440 gbsB; choline dehydrogenase [EC:1.1.1.1]
ec:1.1.1.1 ko:K13951 ADH1_7; alcohol dehydrogenase 1/7 [EC:1.1.1.1]
ec:1.1.1.1 ko:K13952 ADH6; alcohol dehydrogenase 6 [EC:1.1.1.1]
ec:1.1.1.1 ko:K13953 adhP; alcohol dehydrogenase, propanol-preferring [EC:1.1.1.1]
ec:1.1.1.1 ko:K13954 yiaY; alcohol dehydrogenase [EC:1.1.1.1]
ec:1.1.1.1 ko:K13980 ADH4; alcohol dehydrogenase 4 [EC:1.1.1.1]
ec:1.1.1.100 ko:K00059 fabG; 3-oxoacyl-[acyl-carrier protein] reductase [EC:1.1.1.100]
ec:1.1.1.100 ko:K11610 mabA; beta-ketoacyl ACP reductase [EC:1.1.1.100]
ec:1.1.1.108 ko:K17735 lcdH, cdhA; carnitine 3-dehydrogenase [EC:1.1.1.108]
ec:1.11.1.15 ko:K03386 E1.11.1.15, PRDX, ahpC; peroxiredoxin (alkyl hydroperoxide reductase subunit C) [EC:1.11.1.15]
ec:1.11.1.15 ko:K03564 BCP, PRXQ, DOT5; peroxiredoxin Q/BCP [EC:1.11.1.15]
ec:1.11.1.15 ko:K11065 tpx; thiol peroxidase, atypical 2-Cys peroxiredoxin [EC:1.11.1.15]
ec:1.11.1.15 ko:K11185 TRYP; cytosolic tryparedoxin peroxidase, trypanosomatid typical 2-Cys peroxiredoxin [EC:1.11.1.15]
ec:1.11.1.15 ko:K11186 MTRYP; mitochondrial tryparedoxin peroxidase, trypanosomatid typical 2-Cys peroxiredoxin [EC:1.11.1.15]
ec:1.11.1.15 ko:K11187 PRDX5; peroxiredoxin 5, atypical 2-Cys peroxiredoxin [EC:1.11.1.15]
ec:1.11.1.15 ko:K11188 PRDX6; peroxiredoxin 6, 1-Cys peroxiredoxin [EC:1.11.1.7 1.11.1.15 3.1.1.-]
ec:1.11.1.15 ko:K13279 PRDX1; peroxiredoxin 1 [EC:1.11.1.15]
ec:1.11.1.15 ko:K14171 AHP1; alkyl hydroperoxide reductase 1 [EC:1.11.1.15]
ec:1.1.1.132 ko:K00066 algD; GDP-mannose 6-dehydrogenase [EC:1.1.1.132]
ec:1.1.1.133 ko:K00067 rfbD, rmlD; dTDP-4-dehydrorhamnose reductase [EC:1.1.1.133]
ec:1.11.1.5 ko:K00428 E1.11.1.5; cytochrome c peroxidase [EC:1.11.1.5]
Similarly, for the extracted enzymes, we can also list all the known reactions these enzymes are a part of. Each reaction is identified by the R number and is linked to ortholog groups of enzymes enabling integrated analysis of genomic and chemical information.
$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}' test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/link/reaction/ec:$i | grep -Po '(?<=rn:).*') | xargs -n 1 | xargs -I {} curl -s http://rest.kegg.jp/find/reaction/rn:{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:1.10.2.2 rn:R02161 Ubiquinol:ferricytochrome-c oxidoreductase; Ubiquinol + 2 Ferricytochrome c <=> Ubiquinone + 2 Ferrocytochrome c + 2 H+
ec:1.1.1.1 rn:R00228 acetaldehyde:NAD+ oxidoreductase (CoA-acetylating); Acetaldehyde + CoA + NAD+ <=> Acetyl-CoA + NADH + H+
ec:1.1.1.1 rn:R00623 primary_alcohol:NAD+ oxidoreductase; Primary alcohol + NAD+ <=> Aldehyde + NADH + H+
ec:1.1.1.1 rn:R00624 Secondary_alcohol:NAD+ oxidoreductase; Secondary alcohol + NAD+ <=> Ketone + NADH + H+
ec:1.1.1.1 rn:R00754 ethanol:NAD+ oxidoreductase; Ethanol + NAD+ <=> Acetaldehyde + NADH + H+
ec:1.1.1.1 rn:R01172 butanal:NAD+ oxidoreductase (CoA-acylating); Butanal + CoA + NAD+ <=> Butanoyl-CoA + NADH + H+
ec:1.1.1.1 rn:R02124 retinol:NAD+ oxidoreductase; Retinol + NAD+ <=> Retinal + NADH + H+
ec:1.1.1.1 rn:R02878 1-Octanol:NAD+ oxidoreductase; 1-Octanol + NAD+ <=> 1-Octanal + NADH + H+
ec:1.1.1.1 rn:R04805 3alpha,7alpha,26-Trihydroxy-5beta-cholestane + NAD+ <=> 3alpha,7alpha-Dihydroxy-5beta-cholestan-26-al + NADH + H+
ec:1.1.1.1 rn:R04880 3,4-dihydroxyphenylethyleneglycol:NAD+ oxidoreductase; 3,4-Dihydroxyphenylethyleneglycol + NAD+ <=> 3,4-Dihydroxymandelaldehyde + NADH + H+
ec:1.1.1.1 rn:R05233 trans-3-Chloro-2-propene-1-ol:NAD+ oxidoreductase; trans-3-Chloro-2-propene-1-ol + NAD+ <=> trans-3-Chloroallyl aldehyde + NADH + H+
ec:1.1.1.1 rn:R05234 cis-3-chloro-2-propene-1-ol:NAD+ oxidoreductase; cis-3-Chloro-2-propene-1-ol + NAD+ <=> cis-3-Chloroallyl aldehyde + NADH + H+
ec:1.1.1.1 rn:R06917 1-hydroxymethylnaphthalene:NAD+ oxidoreductase; 1-Hydroxymethylnaphthalene + NAD+ <=> 1-Naphthaldehyde + NADH + H+
ec:1.1.1.1 rn:R06927 (2-naphthyl)methanol:NAD+ oxidoreductase; (2-Naphthyl)methanol + NAD+ <=> 2-Naphthaldehyde + NADH + H+
ec:1.1.1.1 rn:R06983 S-(hydroxymethyl)glutathione dehydrogenase; S-(Hydroxymethyl)glutathione + NAD+ <=> S-Formylglutathione + NADH + H+
ec:1.1.1.1 rn:R07105 trichloroethanol:NAD+ oxidoreductase; Chloral hydrate + NADH + H+ <=> Trichloroethanol + NAD+ + H2O
ec:1.1.1.1 rn:R07326 alcohol:NAD+ oxidoreductase; Alcohol + NAD+ <=> Aldehyde + NADH + H+
ec:1.1.1.1 rn:R07327 alcohol:NAD+ oxidoreductase; Alcohol + NAD+ <=> Ketone + NADH + H+
ec:1.1.1.1 rn:R08281 alcophosphamide:NAD+ oxidoreductase; Aldophosphamide + NADH + H+ <=> Alcophosphamide + NAD+
ec:1.1.1.1 rn:R08306 2-phenyl-1,3-propanediol monocarbamate:NAD+ oxidoreductase; 2-Phenyl-1,3-propanediol monocarbamate + NAD+ <=> 3-Carbamoyl-2-phenylpropionaldehyde + NADH + H+
ec:1.1.1.1 rn:R08310 4-hydroxy-5-phenyltetrahydro-1,3-oxazin-2-one:NAD+ oxidoreductase; 4-Hydroxy-5-phenyltetrahydro-1,3-oxazin-2-one + NAD+ <=> 5-Phenyl-1,3-oxazinane-2,4-dione + NADH + H+
ec:1.1.1.1 rn:R08557 Choline + NAD+ <=> Betaine aldehyde + NADH + H+
We may also be interested in knowing which other organisms contain the same enzymes:
$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}' test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/link/genome/ec:$i | grep -Po '(?<=genome:).*') | xargs -n 1 | xargs -I {} curl -s http://rest.kegg.jp/find/genome/genome:{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:1.10.2.2 genome:T01003 rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.10.2.2 genome:T01008 bta, BOVIN, 9913; Bos taurus (cow)
ec:1.1.1.1 genome:T00030 dme, DROME, 7227; Drosophila melanogaster (fruit fly)
ec:1.1.1.1 genome:T01001 hsa, HUMAN, 9606; Homo sapiens (human)
ec:1.1.1.1 genome:T01003 rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.1.1.1 genome:T01058 ecb, HORSE, 9796; Equus caballus (horse)
ec:1.1.1.100 genome:T00007 eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:1.1.1.108 genome:T00035 pae, PSEAE, 208964; Pseudomonas aeruginosa PAO1
ec:1.11.1.15 genome:T01001 hsa, HUMAN, 9606; Homo sapiens (human)
ec:1.11.1.15 genome:T01003 rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.11.1.5 genome:T00005 sce, YEAST, 559292; Saccharomyces cerevisiae S288c
ec:1.1.1.157 genome:T00564 ckl, CLOK5, 431943; Clostridium kluyveri DSM 555
ec:1.11.1.6 genome:T00115 lpl, LACPL, 220668; Lactobacillus plantarum WCFS1
ec:1.11.1.6 genome:T01001 hsa, HUMAN, 9606; Homo sapiens (human)
ec:1.1.1.169 genome:T00005 sce, YEAST, 559292; Saccharomyces cerevisiae S288c
ec:1.1.1.169 genome:T00007 eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:1.11.1.7 genome:T01008 bta, BOVIN, 9913; Bos taurus (cow)
ec:1.11.1.9 genome:T01008 bta, BOVIN, 9913; Bos taurus (cow)
ec:1.1.1.205 genome:T00566 kpn, KLEP7, 272620; Klebsiella pneumoniae subsp. pneumoniae MGH 78578
Best Wishes,
Umer
Thanks, will have a look at it!
This definitely looks like a great solution, will try it. Thanks!
Ok so, after looking into it it really seems like the way I have to go... ;) Thanks for that, the problem is that I don't have any contigs or something else I just have the species name, that's all. Any Ideas how to handle that?
Well you can always download the annotated GBK file from NCBI for different species. Put them all in a folder, use
cat *.gbk > test.gbk
and then keeping your fingers crossed that you have annotated enzymes, you can follow the one-liners then onwards.Read this post of mine.
I guess this will be my work around if anything else fails. What I am thinking about at the moment is using the KEGGREST R API from Bioconductor. The only problem I got at the moment is that I'm not too sure about how to get the first list of pathways available at all. Because if I got those, I can just 'grep' for the T.genome numbers, download those and do what ever I want to them... This, might decrease traffic and therefore time...
I will keep you posted anyways what worked out best! If you have any other Idea let me know! But thanks for your help!!!!!!
Okay I spent last 20 mins thinking over this, and here is the solution ( I have to go somewhere now, I did it in a hurry so excuse a long one-liner, but I think I have done it right):
Step 1: Go to the website, and see which organisms/species are you interested in, and then get their T numbers (which is genome ID), and store them in
IDs.txt
Step 2: Now use the following to extract the pathways for your species/organism:
thank you so much for investing that much into it! I really appreciate it!!!!!!
hey,
apologize distrubing you again but why I am not able to redirect the output into a file instead having it on the terminal? Is there any special way needed to do that?
I was trying it with:
or
but somehow this just creates the file but leaves it empty... (both times)
edit:
solved it. THANK YOU SO MUCH!!