Question

Download Pathways associated with species

1

Entering edit mode

10.7 years ago

Phil S. ▴ 700

Hi guys,

I got a metagenome sample and identified several species/families... in there. Now I'm thinking of automatically gathering pathways (just the names though) for each of those species. Preferably this should be done in R. Any Ideas of a Database / API where to download Pathway-Names given the species name?

Thanks,
Phil

to clarify, basically what I need is typing 'Gardnerella Vaginalis' into some KEGG (or any other) API and I retrieve the list of pathways, here just a snapshot

gvg00010             Glycolysis / Gluconeogenesis - Gardnerella vaginalis ATCC 14019 
gvg00030             Pentose phosphate pathway - Gardnerella vaginalis ATCC 14019 
gvg00040             Pentose and glucuronate interconversions - Gardnerella vaginalis ATCC 14019 
gvg00051             Fructose and mannose metabolism - Gardnerella vaginalis ATCC 14019 
gvg00052             Galactose metabolism - Gardnerella vaginalis ATCC 14019 
gvg00061             Fatty acid biosynthesis - Gardnerella vaginalis ATCC 14019 
gvg00071             Fatty acid degradation - Gardnerella vaginalis ATCC 14019 
gvg00072             Synthesis and degradation of ketone bodies - Gardnerella vaginalis ATCC 14019 
gvg00121             Secondary bile acid biosynthesis - Gardnerella vaginalis ATCC 14019

pathways metagenomes crawling • 5.1k views

ADD COMMENT • link updated 3.3 years ago by Ram 44k • written 10.7 years ago by Phil S. ▴ 700

0

Entering edit mode

10.7 years ago

Pierre Lindenbaum 165k

The NCBI biosystems database contains a resource mapping the BSID to the taxon iD:

ftp://ftp.ncbi.nih.gov/pub/biosystems/CURRENT/biosystems_taxonomy.gz

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 10.7 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

This does not give me the pathways, does it?

ADD REPLY • link 10.7 years ago by Phil S. ▴ 700

0

Entering edit mode

10.7 years ago

Prakki Rama ★ 2.7k

Using the KEGG REST, wrote the following PERL script. Is this what you exactly wanted?

open FH,"pathwayIds.txt";  ##list of organisms in KEGG Org code
while(<FH>)
{
`wget http://rest.kegg.jp/list/pathway/$_`;
}
close(FH);

INPUT: pathwayIds.txt

gva
gvg
gvh

If you just have one organism, then type the following in terminal

wget http://rest.kegg.jp/list/pathway/gvg

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 10.7 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Thank you for adding the answers!

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.7 years ago by Phil S. ▴ 700

Ram · Accepted Answer · 2014-04-25

Hi Phil,

I might not have definite answer but I can point you in the right direction. You can actually use rest-style KEGG API. The API uses KEGG ENZYME database, which is an implementation of the Enzyme Nomenclature (EC Numbers) on the ExplorEnz database, and is maintained in the KEGG LIGAND relational database with additional annotation of reaction hierarchy, organism information, and sequence data links.

HOWEVER, I'll start with an annotated genbank file that you may create using an annotation tool such as PROKKA on your metagenomic contigs. Moreover, I'll start off with enzymes, rather than species and families (Probably you can construct your queries in a similar manner by exploring those features that I have not done so and by writing similar one-liners).

If I want to extract a tab-separated list of only those contigs which have enzymes in them (all other contigs are ignored):

$ awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","ec:\\1","g")}/^LOCUS/{print ";"$2}'  test.gbk | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}'
NODE_2925_length_923_cov_1.698808	ec:4.2.1.17	
NODE_4520_length_55408_cov_7.328635	ec:1.11.1.6	ec:1.2.1.22	ec:1.1.99.1	ec:2.5.1.18	
NODE_6043_length_823_cov_1.500607	ec:6.2.1.25	
NODE_918_length_270_cov_10.744445	ec:1.5.3.1	
NODE_312_length_2459_cov_13.763318	ec:4.3.1.17	
NODE_282_length_4935_cov_13.040932	ec:2.7.6.5	ec:3.1.7.2	ec:2.7.7.6	ec:2.7.4.8	
NODE_3153_length_49002_cov_8.110995	ec:1.4.3.19	ec:4.1.3.39	ec:5.1.1.8	ec:3.5.4.22	ec:1.2.1.26	ec:1.1.1.46	ec:1.1.1.1	ec:4.1.1.28	
NODE_241_length_4648_cov_13.109509	ec:4.2.1.20	
NODE_533_length_2973_cov_13.546249	ec:4.2.1.20	
NODE_2986_length_180_cov_3.444444	ec:1.2.1.38	
NODE_2957_length_228_cov_5.881579	ec:4.2.1.52	
NODE_2313_length_214_cov_3.420561	ec:1.14.11.17	
NODE_2177_length_8710_cov_10.954764	ec:4.1.1.36	ec:6.3.2.5	ec:3.6.1.23	ec:5.4.2.8	ec:5.4.2.2	ec:2.7.2.8	ec:2.4.2.10	
NODE_2125_length_194_cov_3.505155	ec:1.2.1.27	
NODE_195_length_3240_cov_14.628086	ec:4.2.1.75	ec:4.2.1.75	
NODE_395_length_847_cov_14.246754	ec:2.1.1.45	
NODE_263_length_1894_cov_12.784055	ec:1.8.4.11	
NODE_2068_length_256_cov_5.703125	ec:6.3.2.10	
NODE_5094_length_82060_cov_7.647209	ec:5.2.1.8	ec:3.4.21.92	ec:2.5.1.54	ec:3.1.1.3	ec:5.2.1.8	ec:1.1.1.30	
NODE_1254_length_19746_cov_10.331156	ec:1.6.5.5	ec:1.5.1.29	
NODE_4044_length_223_cov_4.789237	ec:1.1.99.1	
NODE_2933_length_5310_cov_10.508098	ec:3.4.13.19	
NODE_2296_length_260_cov_4.046154	ec:2.7.1.24	
NODE_618_length_171_cov_3.111111	ec:2.3.1.31	
NODE_4435_length_1843_cov_2.593597	ec:6.1.1.9	
NODE_1478_length_868_cov_12.869816	ec:2.4.2.17	
NODE_518_length_1838_cov_14.368879	ec:2.1.1.176	ec:2.1.2.9	
NODE_4252_length_45804_cov_8.195944	ec:1.11.1.6	ec:1.11.1.7	ec:2.7.1.23	ec:6.2.1.3	ec:4.1.1.18	ec:2.7.7.7	ec:3.1.26.4

If I want to extract the list of all enzymes found in the genbank file and uses rest-style KEGG API to generate names from EC numbers

$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/find/enzyme/$i) | awk '{$1=$1"\t"}1'; done
ec:1.1.1.108	 carnitine 3-dehydrogenase
ec:1.11.1.15	 peroxiredoxin; thioredoxin peroxidase; tryparedoxin peroxidase; alkyl hydroperoxide reductase C22; AhpC; TrxPx; TXNPx; Prx; PRDX
ec:1.1.1.158	 Transferred to 1.3.1.98
ec:1.11.1.6	 catalase; equilase; caperase; optidase; catalase-peroxidase; CAT
ec:1.1.1.169	 2-dehydropantoate 2-reductase; 2-oxopantoate reductase; 2-ketopantoate reductase; 2-ketopantoic acid reductase; ketopantoate reductase; ketopantoic acid reductase
ec:1.11.1.7	 peroxidase; lactoperoxidase; guaiacol peroxidase; plant peroxidase; Japanese radish peroxidase; horseradish peroxidase (HRP); soybean peroxidase (SBP); extensin peroxidase; heme peroxidase; oxyperoxidase; protoheme peroxidase; pyrocatechol peroxidase; scopoletin peroxidase; Coprinus cinereus peroxidase; Arthromyces ramosus peroxidase
ec:1.11.1.9	 glutathione peroxidase; GSH peroxidase; selenium-glutathione peroxidase; reduced glutathione peroxidase
ec:1.1.1.205	 IMP dehydrogenase; inosine-5'-phosphate dehydrogenase; inosinic acid dehydrogenase; inosinate dehydrogenase; inosine 5'-monophosphate dehydrogenase; inosine monophosphate dehydrogenase; IMP oxidoreductase; inosine monophosphate oxidoreductase ec:1.2.1.14 Transferred to 1.1.1.205
ec:1.1.1.215	 gluconate 2-dehydrogenase; 2-keto-D-gluconate reductase; 2-ketogluconate reductase

For the extracted enzymes, we can list all the KEGG Ortholog (KO) groups each enzyme is part of, along with their detailed description. The KO system is the basis for representation for all proteins and functional RNAs that correspond to KEGG pathway nodes, BRITE hierarchy nodes, and KEGG module nodes:

$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/link/ko/ec:$i | grep -Po '(?<=ko:).*' | sed -e "s/K/ko:K/g") | xargs -n 1 | xargs -I {} curl -s http://rest.kegg.jp/find/ko/{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:1.10.2.2	ko:K00411	UQCRFS1, RIP1, petA; ubiquinol-cytochrome c reductase iron-sulfur subunit [EC:1.10.2.2]
ec:1.1.1.1	ko:K00001	E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
ec:1.1.1.1	ko:K00121	frmA, ADH5, adhC; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ec:1.1.1.1	ko:K04072	adhE; acetaldehyde dehydrogenase / alcohol dehydrogenase [EC:1.2.1.10 1.1.1.1]
ec:1.1.1.1	ko:K11440	gbsB; choline dehydrogenase [EC:1.1.1.1]
ec:1.1.1.1	ko:K13951	ADH1_7; alcohol dehydrogenase 1/7 [EC:1.1.1.1]
ec:1.1.1.1	ko:K13952	ADH6; alcohol dehydrogenase 6 [EC:1.1.1.1]
ec:1.1.1.1	ko:K13953	adhP; alcohol dehydrogenase, propanol-preferring [EC:1.1.1.1]
ec:1.1.1.1	ko:K13954	yiaY; alcohol dehydrogenase [EC:1.1.1.1]
ec:1.1.1.1	ko:K13980	ADH4; alcohol dehydrogenase 4 [EC:1.1.1.1]
ec:1.1.1.100	ko:K00059	fabG; 3-oxoacyl-[acyl-carrier protein] reductase [EC:1.1.1.100]
ec:1.1.1.100	ko:K11610	mabA; beta-ketoacyl ACP reductase [EC:1.1.1.100]
ec:1.1.1.108	ko:K17735	lcdH, cdhA; carnitine 3-dehydrogenase [EC:1.1.1.108]
ec:1.11.1.15	ko:K03386	E1.11.1.15, PRDX, ahpC; peroxiredoxin (alkyl hydroperoxide reductase subunit C) [EC:1.11.1.15]
ec:1.11.1.15	ko:K03564	BCP, PRXQ, DOT5; peroxiredoxin Q/BCP [EC:1.11.1.15]
ec:1.11.1.15	ko:K11065	tpx; thiol peroxidase, atypical 2-Cys peroxiredoxin [EC:1.11.1.15]
ec:1.11.1.15	ko:K11185	TRYP; cytosolic tryparedoxin peroxidase, trypanosomatid typical 2-Cys peroxiredoxin [EC:1.11.1.15]
ec:1.11.1.15	ko:K11186	MTRYP; mitochondrial tryparedoxin peroxidase, trypanosomatid typical 2-Cys peroxiredoxin [EC:1.11.1.15]
ec:1.11.1.15	ko:K11187	PRDX5; peroxiredoxin 5, atypical 2-Cys peroxiredoxin [EC:1.11.1.15]
ec:1.11.1.15	ko:K11188	PRDX6; peroxiredoxin 6, 1-Cys peroxiredoxin [EC:1.11.1.7 1.11.1.15 3.1.1.-]
ec:1.11.1.15	ko:K13279	PRDX1; peroxiredoxin 1 [EC:1.11.1.15]
ec:1.11.1.15	ko:K14171	AHP1; alkyl hydroperoxide reductase 1 [EC:1.11.1.15]
ec:1.1.1.132	ko:K00066	algD; GDP-mannose 6-dehydrogenase [EC:1.1.1.132]
ec:1.1.1.133	ko:K00067	rfbD, rmlD; dTDP-4-dehydrorhamnose reductase [EC:1.1.1.133]
ec:1.11.1.5	ko:K00428	E1.11.1.5; cytochrome c peroxidase [EC:1.11.1.5]

Similarly, for the extracted enzymes, we can also list all the known reactions these enzymes are a part of. Each reaction is identified by the R number and is linked to ortholog groups of enzymes enabling integrated analysis of genomic and chemical information.

$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/link/reaction/ec:$i | grep -Po '(?<=rn:).*') | xargs -n 1 | xargs -I {} curl -s http://rest.kegg.jp/find/reaction/rn:{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:1.10.2.2	rn:R02161	Ubiquinol:ferricytochrome-c oxidoreductase; Ubiquinol + 2 Ferricytochrome c <=> Ubiquinone + 2 Ferrocytochrome c + 2 H+
ec:1.1.1.1	rn:R00228	acetaldehyde:NAD+ oxidoreductase (CoA-acetylating); Acetaldehyde + CoA + NAD+ <=> Acetyl-CoA + NADH + H+
ec:1.1.1.1	rn:R00623	primary_alcohol:NAD+ oxidoreductase; Primary alcohol + NAD+ <=> Aldehyde + NADH + H+
ec:1.1.1.1	rn:R00624	Secondary_alcohol:NAD+ oxidoreductase; Secondary alcohol + NAD+ <=> Ketone + NADH + H+
ec:1.1.1.1	rn:R00754	ethanol:NAD+ oxidoreductase; Ethanol + NAD+ <=> Acetaldehyde + NADH + H+
ec:1.1.1.1	rn:R01172	butanal:NAD+ oxidoreductase (CoA-acylating); Butanal + CoA + NAD+ <=> Butanoyl-CoA + NADH + H+
ec:1.1.1.1	rn:R02124	retinol:NAD+ oxidoreductase; Retinol + NAD+ <=> Retinal + NADH + H+
ec:1.1.1.1	rn:R02878	1-Octanol:NAD+ oxidoreductase; 1-Octanol + NAD+ <=> 1-Octanal + NADH + H+
ec:1.1.1.1	rn:R04805	3alpha,7alpha,26-Trihydroxy-5beta-cholestane + NAD+ <=> 3alpha,7alpha-Dihydroxy-5beta-cholestan-26-al + NADH + H+
ec:1.1.1.1	rn:R04880	3,4-dihydroxyphenylethyleneglycol:NAD+ oxidoreductase; 3,4-Dihydroxyphenylethyleneglycol + NAD+ <=> 3,4-Dihydroxymandelaldehyde + NADH + H+
ec:1.1.1.1	rn:R05233	trans-3-Chloro-2-propene-1-ol:NAD+ oxidoreductase; trans-3-Chloro-2-propene-1-ol + NAD+ <=> trans-3-Chloroallyl aldehyde + NADH + H+
ec:1.1.1.1	rn:R05234	cis-3-chloro-2-propene-1-ol:NAD+ oxidoreductase; cis-3-Chloro-2-propene-1-ol + NAD+ <=> cis-3-Chloroallyl aldehyde + NADH + H+
ec:1.1.1.1	rn:R06917	1-hydroxymethylnaphthalene:NAD+ oxidoreductase; 1-Hydroxymethylnaphthalene + NAD+ <=> 1-Naphthaldehyde + NADH + H+
ec:1.1.1.1	rn:R06927	(2-naphthyl)methanol:NAD+ oxidoreductase; (2-Naphthyl)methanol + NAD+ <=> 2-Naphthaldehyde + NADH + H+
ec:1.1.1.1	rn:R06983	S-(hydroxymethyl)glutathione dehydrogenase; S-(Hydroxymethyl)glutathione + NAD+ <=> S-Formylglutathione + NADH + H+
ec:1.1.1.1	rn:R07105	trichloroethanol:NAD+ oxidoreductase; Chloral hydrate + NADH + H+ <=> Trichloroethanol + NAD+ + H2O
ec:1.1.1.1	rn:R07326	alcohol:NAD+ oxidoreductase; Alcohol + NAD+ <=> Aldehyde + NADH + H+
ec:1.1.1.1	rn:R07327	alcohol:NAD+ oxidoreductase; Alcohol + NAD+ <=> Ketone + NADH + H+
ec:1.1.1.1	rn:R08281	alcophosphamide:NAD+ oxidoreductase; Aldophosphamide + NADH + H+ <=> Alcophosphamide + NAD+
ec:1.1.1.1	rn:R08306	2-phenyl-1,3-propanediol monocarbamate:NAD+ oxidoreductase; 2-Phenyl-1,3-propanediol monocarbamate + NAD+ <=> 3-Carbamoyl-2-phenylpropionaldehyde + NADH + H+
ec:1.1.1.1	rn:R08310	4-hydroxy-5-phenyltetrahydro-1,3-oxazin-2-one:NAD+ oxidoreductase; 4-Hydroxy-5-phenyltetrahydro-1,3-oxazin-2-one + NAD+ <=> 5-Phenyl-1,3-oxazinane-2,4-dione + NADH + H+
ec:1.1.1.1	rn:R08557	Choline + NAD+ <=> Betaine aldehyde + NADH + H+

We may also be interested in knowing which other organisms contain the same enzymes:

$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/link/genome/ec:$i | grep -Po '(?<=genome:).*') | xargs -n 1 | xargs -I {} curl -s http://rest.kegg.jp/find/genome/genome:{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:1.10.2.2	genome:T01003	rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.10.2.2	genome:T01008	bta, BOVIN, 9913; Bos taurus (cow)
ec:1.1.1.1	genome:T00030	dme, DROME, 7227; Drosophila melanogaster (fruit fly)
ec:1.1.1.1	genome:T01001	hsa, HUMAN, 9606; Homo sapiens (human)
ec:1.1.1.1	genome:T01003	rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.1.1.1	genome:T01058	ecb, HORSE, 9796; Equus caballus (horse)
ec:1.1.1.100	genome:T00007	eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:1.1.1.108	genome:T00035	pae, PSEAE, 208964; Pseudomonas aeruginosa PAO1
ec:1.11.1.15	genome:T01001	hsa, HUMAN, 9606; Homo sapiens (human)
ec:1.11.1.15	genome:T01003	rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.11.1.5	genome:T00005	sce, YEAST, 559292; Saccharomyces cerevisiae S288c
ec:1.1.1.157	genome:T00564	ckl, CLOK5, 431943; Clostridium kluyveri DSM 555
ec:1.11.1.6	genome:T00115	lpl, LACPL, 220668; Lactobacillus plantarum WCFS1
ec:1.11.1.6	genome:T01001	hsa, HUMAN, 9606; Homo sapiens (human)
ec:1.1.1.169	genome:T00005	sce, YEAST, 559292; Saccharomyces cerevisiae S288c
ec:1.1.1.169	genome:T00007	eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:1.11.1.7	genome:T01008	bta, BOVIN, 9913; Bos taurus (cow)
ec:1.11.1.9	genome:T01008	bta, BOVIN, 9913; Bos taurus (cow)
ec:1.1.1.205	genome:T00566	kpn, KLEP7, 272620; Klebsiella pneumoniae subsp. pneumoniae MGH 78578

Best Wishes,
Umer

view raw biostars-98923.md hosted with ❤ by GitHub