Verify if all the components of a pathway are present in a genomes
4
2
Entering edit mode
10.2 years ago
dago ★ 2.8k

I would like to verify if all the protein in specific pathways (e.g. secretion systems, two component regulatory system) are present in a genome.

Is there a way to do this automatically iterating the search for many genomes?

genome annotation • 9.1k views
ADD COMMENT
10
Entering edit mode
10.2 years ago

You can use MinPath to see if all the enzymes are sufficient for a given Pathway to exist (the other way around). This is what I will do:

If I start with contigs or a genome sequence (can be a file with multiple genomes in it), I will use a tool such as PROKKA to generate a genbank file of my annotation as well as all the extracted CDS regions as a FASTA file with amino acid sequences.

Say I start with the following file:

$ head Sample_1.fasta
>PROT1
AAPAAAAPAAAPAAQAAAPAAAGGDDAILSPAARKLAEEAGIDPNSIAGTGKGGRVTKEDVVAAVEAKKNAPAALAKPAAPAAEAPIFAAGDRVEKRVPMTRLRAKVAERLVEAQSAMAMLTTFNEVNMKPIMDLRSKYKDLFEKKHNGVRLGFMSFFVKAATEALKRFPGVNASIDGNDIVYHGYQDIGVAVSSDRGLVVPVLRNAEFMSLAEIEGGIANFGKKAKEGKLTIEDMTGGTFTISNGGVFGSLLSTPIVNPPQTAILGMHKIQERPMAVNGQVVILPMMYLALSYDHRLIDGKEAVSFLVAIKDLLEDPARLLLDV
>PROT2
AAPAAAGGDDAILSPAARKLAEEAGIDPNSIAGTGKGGRVTKEDVVAAVEAKKNAPAAPAKPAAPAAEAPIFAAGDRVEKRVPMTRLRAKVAERLVEAQSAMAMLTTFNEVNMKPIMDLRSKYKDLFEKKHNGVRLGFMSFFVKAATEALKRFPGVNASIDGNDIVYHGYQDIGVAVSSDRGLVVPVLRNAEFMSLAEIEGGIANFGKKAKEGKLTIEDMTGGTFTISNGGVFGSLLSTPIVNPPQTAILGMHKIQERPMAVNGQVVILPMMYLALSYDHRLIDGKEAVSFLVAIKDLLEDPARLLLDV
>PROT3
AKSKFERNKPHVNIGTIGHVDHGKTSLTAAITKYFGEFKAYDQIDAAPEEKARGITISTAHVEYETPNRHYAHVDCPGHADYVKNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVPAIVVFLNKVDQVDDAELLELVELEVRELLSSYEFPGDDIPIVKGSALAALEDSDKKIGEDAIRELMAAVDAYIPTPERPIDQPFLMPIEDVFSISGRGTVVTGRVERGIVKVGEEIEIVGIRPTTKTTCTGVEMFRKLLDQGQAGDNIGALLRGVDRNGVERGQILCKPGSVKPHRKFKAEAYILTKEEGGRHTPFFTNYRPQFYFRTTDVTGIVTLPEGTEMVMPGDNVTVDVELIVPIAMEEKLRFAIREGGRTVGAGIVASIVE
>PROT4
APAAAAPAAAPAAQAAAPAAAGGDDAILSPAARKLAEEAGIDPNSIAGTGKGGRVTKEDVVAAVEAKKNAPAAPAKPAAPAAEAPIFAAGDRVEKRVPMTRLRAKVAERLVEAQSAMAMLTTFNEVNMKPIMDLRSKYKDLFEKKHNGVRLGFMSFFVKAATEALKRFPGVNASIDGNDIVYHGYQDIGVAVSSDRGLVVPVLRNAEFMSLAEIEGGIANFGKKAKEGKLTIEDMTGGTFTISNGGVFGSLLSTPIVNPPQTAILGMHKIQERPMAVNGQVVILPMMYLALSYDHRLIDGKEAVSFLVAIKDLLEDPARLLLDV
>PROT5
ATLLAQAIIREGMKNVTAGANPMVVRKGIQDAVDAAVEALHKNSKKVTCSEDIARVATISAGDETVGKLIAEAMEKVTSDGVITVEESKTSETYSEVVEGMMFDRGYISPYMATDTEKMEAIMDDAYLLITDKKISNIQEILPLLEKIVQSGKKLVIIAEDVEGEALTTLILNKLRGTFNCAAV

Next, we use RAPsearch to search our protein sequences against the KEGG database and get a tab-delimited blast output file (identified by *.out.m8 ending). To speed up RAPsearch, you can specify the number of cores to run on using -z switch.

$ rapsearch -d /PATH_TO_KEGG_DB/kegg_db -z 2 -q Sample_1.fasta -o Sample_1.out
$ head *.out.m8
# RAPSearch
# Job submitted: Thu Jul  3 17:57:51 2014
# Query : Sample_1.fasta
# Subject : /PATH_TO_KEGG_DB/kegg_db
# Fields: Query	Subject	identity	aln-len	mismatch	gap-openings	q.start	q.end	s.start	s.end	log(e-value)	bit-score
PROT1	psa:PST_1876	89.2308	325	35	0	1	325	84	408	-156.9	557.37
PROT1	pmy:Pmen_2502	84.7095	327	48	1	1	325	84	410	-149.71	533.49
PROT1	psb:Psyr_2010	78.7692	325	67	1	1	325	89	411	-137.65	493.43
PROT1	hch:HCH_04744	73.2143	336	74	4	2	325	80	411	-130.46	469.54
PROT1	sde:Sde_2105	71.6049	324	89	2	5	325	80	403	-124.55	449.9

We can then identify enzymes in our dataset. For this purpose, we use a REST-style KEGG API to link KEGG entries (2nd column in the output file) to EC numbers:

$ awk -F"\t" '!/#/{print $1","$2}' Sample_1.out.m8 | while IFS=$"," read -r -a mA; do echo -e "${mA[0]}\t$(curl -s http://rest.kegg.jp/link/ec/${mA[1]})"; done | awk '{print $1"\t"$3}' > ec_assignments.txt
$ head ec_assignments.txt
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61

RAPsearch reports multiple matches for a single protein sequence. We assign the most abundant EC number to each protein sequence:

$ sort ec_assignments.txt | uniq -c | grep -i "ec:" | awk '{gsub("^ec:","",$3);if(ec[$2]){if($1>=c[$2]){ec[$2]=$3;c[$2]=$1}} else {ec[$2]=$3;c[$2]=$1}} END{for (i in ec) print i"\t"ec[i]}' > Sample_1.ec
$ tail Sample_1.ec
PROT1	2.3.1.61
PROT78	4.2.1.11
PROT2	2.3.1.61
PROT136	1.1.1.86
PROT137	1.12.7.2
PROT4	2.3.1.61
PROT138	2.7.9.1
PROT139	3.2.1.23
PROT95	1.1.1.86
PROT9	3.6.3.14

We next construct parsimonious pathways using MinPath as it does not attempt to reconstruct entire pathways from a given set of protein sequences identified in a protemics/genomics experiment, but determines the minimal set of biological pathways that must exist in the biological system to explain the input protein sequences. It takes a set of reference pathways and a set of proteins that can be mapped to one or more pathways, and finds the minimum number of pathways that can explain all the protein functions. MinPath comes with a mapping file ec2path that lists the reference pathways from MetaCyc and all the EC numbers associated with them. MetaCyc serves as an encyclopedia of metabolism containing more than 2151 patways from more than 2500 different organisms.

$  head -50 ec2path
#Metacyc pathway and ec mapping file
#Pathway	EC
PWY0-1479	3.1.26.12
PWY0-1479	2.7.7.56
PWY0-1479	3.1.13.5
PWY0-1479	3.1.26.5
PWY-6146	4.2.1.11
PWY-6146	5.4.2.1
PWY-6146	1.2.7.6
PWY-6146	3.5.4.9
PWY-6146	6.4.1.1
PWY-6146	1.2.7.1
PWY-6146	2.6.1.1
PWY-6146	2.6.1.2
PWY-6146	1.1.1.261
...

To construct pathways from the generated EC file using MetaCyc pathways (using default mapping file in MinPath), run it as follows:

$ MinPath1.2.py -any Sample_1.ec -map ec2path -report Sample_1.ec.report -details Sample_1.ec.details
$ head Sample_1.ec.report
path 2 any n/a  naive 1  minpath 0  fam0  39  fam-found  2  name  PWY-6146
path 4 any n/a  naive 1  minpath 1  fam0  8  fam-found  1  name  BRANCHED-CHAIN-AA-SYN-PWY
path 11 any n/a  naive 1  minpath 1  fam0  9  fam-found  1  name  PWY-821
path 44 any n/a  naive 1  minpath 1  fam0  1  fam-found  1  name  GLUTSYNIII-PWY
path 46 any n/a  naive 1  minpath 0  fam0  6  fam-found  1  name  PWY-5505
path 49 any n/a  naive 1  minpath 1  fam0  8  fam-found  1  name  PWY-6549
path 53 any n/a  naive 1  minpath 0  fam0  13  fam-found  1  name  THREOCAT-PWY
path 55 any n/a  naive 1  minpath 0  fam0  5  fam-found  1  name  ILEUSYN-PWY
path 56 any n/a  naive 1  minpath 0  fam0  11  fam-found  1  name  PWY-3001
path 57 any n/a  naive 1  minpath 0  fam0  6  fam-found  1  name  PWY-5101

The last column reports the pathways, which you can search on MetaCyc website to get the detailed information. For example, here is the record for BRANCHED-CHAIN-AA-SYN-PWY on the website:

< image not found >

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Metaproteomics/metacyc.png" style="height:410px; width:600px" />

However, we are interested in KEGG pathways so that we can draw metabolic networks using iPath. We, therefore, generate a new mapping file using the REST-style KEGG API:

$ for I in $(curl -S http://rest.kegg.jp/list/pathway | awk -F"\t" '{print $1}'); do curl -S http://rest.kegg.jp/link/ec/$i; done > ec2kegg
$ awk -F"\t" '!/^$/{gsub("^path:","",$1);gsub("^ec:","",$2);print $1"\t"$2}' ec2kegg > ec2kegg_final
$ chmod +x ec2kegg_final
$ head -50 ec2kegg_final
map00010	1.1.1.1
map00010	1.1.1.2
map00010	1.1.1.27
map00010	1.1.2.7
map00010	1.1.2.8
map00010	1.2.1.12
map00010	1.2.1.3
map00010	1.2.1.5
map00010	1.2.1.59
map00010	1.2.1.9
...

Now run MinPath1.2.py with the new mapping file using -map ec2kegg_final in the command:

$ MinPath1.2.py -any Sample_1.ec -map ec2kegg_final -report Sample_1.ec.report.kegg -details Sample_1.ec.details.kegg
$ head Sample_1.ec.report.kegg
path 1 any n/a  naive 1  minpath 0  fam0  45  fam-found  3  name  map00010
path 2 any n/a  naive 1  minpath 0  fam0  24  fam-found  1  name  map00020
path 3 any n/a  naive 1  minpath 0  fam0  50  fam-found  1  name  map00030
path 4 any n/a  naive 1  minpath 0  fam0  63  fam-found  1  name  map00040
path 5 any n/a  naive 1  minpath 0  fam0  70  fam-found  2  name  map00051
path 6 any n/a  naive 1  minpath 0  fam0  46  fam-found  1  name  map00052
path 9 any n/a  naive 1  minpath 0  fam0  13  fam-found  1  name  map00062
path 10 any n/a  naive 1  minpath 0  fam0  30  fam-found  2  name  map00071
path 18 any n/a  naive 1  minpath 0  fam0  11  fam-found  2  name  map00190
path 19 any n/a  naive 1  minpath 0  fam0  3  fam-found  1  name  map00195

To get the description of the reported KEGG pathways, use the following one-liner:

$ for I in $(awk '{print $14}' Sample_1.ec.report.kegg); do curl -S http://rest.kegg.jp/find/pathway/$i; done
path:map00010	Glycolysis / Gluconeogenesis
path:map00020	Citrate cycle (TCA cycle)
path:map00030	Pentose phosphate pathway
path:map00040	Pentose and glucuronate interconversions
path:map00051	Fructose and mannose metabolism
path:map00052	Galactose metabolism
path:map00062	Fatty acid elongation
path:map00071	Fatty acid degradation
path:map00190	Oxidative phosphorylation
path:map00195	Photosynthesis
path:map00230	Purine metabolism
path:map00240	Pyrimidine metabolism
path:map00250	Alanine, aspartate and glutamate metabolism
path:map00280	Valine, leucine and isoleucine degradation
path:map00281	Geraniol degradation
path:map00290	Valine, leucine and isoleucine biosynthesis
path:map00310	Lysine degradation
path:map00330	Arginine and proline metabolism
path:map00360	Phenylalanine metabolism
path:map00362	Benzoate degradation
path:map00380	Tryptophan metabolism
path:map00410	beta-Alanine metabolism
path:map00511	Other glycan degradation
path:map00531	Glycosaminoglycan degradation
path:map00592	alpha-Linolenic acid metabolism
path:map00600	Sphingolipid metabolism
path:map00604	Glycosphingolipid biosynthesis - ganglio series
path:map00620	Pyruvate metabolism
path:map01120	Microbial metabolism in diverse environments
...

We want to use interactive Pathways Explorer (iPath), which is a web-based tool for the visualization, analysis and customization of the various pathways maps. The metabolic pathways are constructed using 146 KEGG pathways, and give an overview of the complete metabolism in biological systems. We first extract the pathways in the format to be used with iPath and give each of them red color:

$ awk '{gsub("^map","",$14);print $14" #ff0000"}' Sample_1.ec.report.kegg > Sample_1.ec.report.kegg.ipath

Now go to http://pathways.embl.de/index.html and click on iPath v2 interface to load the applet:

< image not found >

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Metaproteomics/ipath.png" style="height:313px; width:600px" />

Next click on the "Customize" button to load the "Customization and data mapping" tab, and upload Sample_1.ec.report.kegg.ipath by clicking on "Select file" button. This will populate the "Element selection" text area. You can manually assign different colors to your pathways here. Finally, click on "Submit data and customize maps" button to generate your metabolic network.

< image not found >

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Metaproteomics/dataupload.png" style="height:424px; width:600px" />

Once generated, you can save the metabolic network as a PDF document to your local computer. The connections in red are the ones present in your sample:

< image not found >

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Metaproteomics/Sample_1.png" style="height:366px; width:600px" />

Best Wishes,
Umer

Cleaned up the post and moved to a Gist by Ram on 22-Apr-2022.

ADD COMMENT
0
Entering edit mode

Wow this looks like a really detailed work flow. I will give it a try but I am a bit dumb with the KEGG API. Any tutorial I could use to learn it?

ADD REPLY
1
Entering edit mode

Hi dago,

I figured it out by hit and trial method. Bioconductor's KEGGREST could be a good starting point to explore KEGG REST service further. Please also have a look at the HUManN workflow on how they incorporated MinPath. Here are a few more one-liners that you can use on a GENBANK file to understand the KEGG service better:

The API uses KEGG ENZYME database, which is an implementation of the Enzyme Nomenclature (EC Numbers) on the ExplorEnz database, and is maintained in the KEGG LIGAND relational database with additional annotation of reaction hierarchy, organism information, and sequence data links.

To use these one-liners on your GENBANK files, replace test.gbk with the name of the file you are using. First step is to extract a tab-separated list of only those contigs (once you have annotated them through PROKKA) which have enzymes in them. All other contigs are ignored

$ awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","ec:\\1","g")}/^LOCUS/{print ";"$2}'  test.gbk | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}'
NODE_2925_length_923_cov_1.698808   ec:4.2.1.17 
NODE_4520_length_55408_cov_7.328635 ec:1.11.1.6 ec:1.2.1.22 ec:1.1.99.1 ec:2.5.1.18 
NODE_6043_length_823_cov_1.500607   ec:6.2.1.25 
NODE_918_length_270_cov_10.744445   ec:1.5.3.1  
NODE_312_length_2459_cov_13.763318  ec:4.3.1.17 
NODE_282_length_4935_cov_13.040932  ec:2.7.6.5  ec:3.1.7.2  ec:2.7.7.6  ec:2.7.4.8  
NODE_3153_length_49002_cov_8.110995 ec:1.4.3.19 ec:4.1.3.39 ec:5.1.1.8  ec:3.5.4.22 ec:1.2.1.26 ec:1.1.1.46 ec:1.1.1.1  ec:4.1.1.28 
NODE_241_length_4648_cov_13.109509  ec:4.2.1.20 
NODE_533_length_2973_cov_13.546249  ec:4.2.1.20
...

The following one-liner extracts the list of all enzymes found in the GENBANK file and uses rest-style KEGG API to generate names from EC numbers:

$ for I in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/find/enzyme/$i) | awk '{$1=$1"\t"}1'; done
ec:1.1.1.108     carnitine 3-dehydrogenase
ec:1.11.1.15     peroxiredoxin; thioredoxin peroxidase; tryparedoxin peroxidase; alkyl hydroperoxide reductase C22; AhpC; TrxPx; TXNPx; Prx; PRDX
ec:1.1.1.158     Transferred to 1.3.1.98
ec:1.11.1.6  catalase; equilase; caperase; optidase; catalase-peroxidase; CAT
ec:1.1.1.169     2-dehydropantoate 2-reductase; 2-oxopantoate reductase; 2-ketopantoate reductase; 2-ketopantoic acid reductase; ketopantoate reductase; ketopantoic acid reductase
ec:1.11.1.7  peroxidase; lactoperoxidase; guaiacol peroxidase; plant peroxidase; Japanese radish peroxidase; horseradish peroxidase (HRP); soybean peroxidase (SBP); extensin peroxidase; heme peroxidase; oxyperoxidase; protoheme peroxidase; pyrocatechol peroxidase; scopoletin peroxidase; Coprinus cinereus peroxidase; Arthromyces ramosus peroxidase
...

For the extracted enzymes, we can list all the KEGG Ortholog (KO) groups each enzyme is part of, along with their detailed description. The KO system is the basis for representation for all proteins and functional RNAs that correspond to KEGG pathway nodes, BRITE hierarchy nodes, and KEGG module nodes:

$ for I in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/link/ko/ec:$i | grep -Po '(?<=ko:).*' | sed -e "s/K/ko:K/g") | xargs -n 1 | xargs -I {} curl -s http://rest.kegg.jp/find/ko/{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:1.10.2.2 ko:K00411   UQCRFS1, RIP1, petA; ubiquinol-cytochrome c reductase iron-sulfur subunit [EC:1.10.2.2]
ec:1.1.1.1  ko:K00001   E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
ec:1.1.1.1  ko:K00121   frmA, ADH5, adhC; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ec:1.1.1.1  ko:K04072   adhE; acetaldehyde dehydrogenase / alcohol dehydrogenase [EC:1.2.1.10 1.1.1.1]
ec:1.1.1.1  ko:K11440   gbsB; choline dehydrogenase [EC:1.1.1.1]
ec:1.1.1.1  ko:K13951   ADH1_7; alcohol dehydrogenase 1/7 [EC:1.1.1.1]
ec:1.1.1.1  ko:K13952   ADH6; alcohol dehydrogenase 6 [EC:1.1.1.1]
ec:1.1.1.1  ko:K13953   adhP; alcohol dehydrogenase, propanol-preferring [EC:1.1.1.1]
ec:1.1.1.1  ko:K13954   yiaY; alcohol dehydrogenase [EC:1.1.1.1]
ec:1.1.1.1  ko:K13980   ADH4; alcohol dehydrogenase 4 [EC:1.1.1.1]
ec:1.1.1.100    ko:K00059   fabG; 3-oxoacyl-[acyl-carrier protein] reductase [EC:1.1.1.100]
ec:1.1.1.100    ko:K11610   mabA; beta-ketoacyl ACP reductase [EC:1.1.1.100]
ec:1.1.1.108    ko:K17735   lcdH, cdhA; carnitine 3-dehydrogenase [EC:1.1.1.108]
...

Similarly, for the extracted enzymes, we can also list all the known reactions these enzymes are a part of. Each reaction is identified by the R number and is linked to ortholog groups of enzymes enabling integrated analysis of genomic and chemical information.

$ for I in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/link/reaction/ec:$i | grep -Po '(?<=rn:).*') | xargs -n 1 | xargs -I {} curl -s http://rest.kegg.jp/find/reaction/rn:{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:1.10.2.2 rn:R02161   Ubiquinol:ferricytochrome-c oxidoreductase; Ubiquinol + 2 Ferricytochrome c <=> Ubiquinone + 2 Ferrocytochrome c + 2 H+
ec:1.1.1.1  rn:R00228   acetaldehyde:NAD+ oxidoreductase (CoA-acetylating); Acetaldehyde + CoA + NAD+ <=> Acetyl-CoA + NADH + H+
ec:1.1.1.1  rn:R00623   primary_alcohol:NAD+ oxidoreductase; Primary alcohol + NAD+ <=> Aldehyde + NADH + H+
ec:1.1.1.1  rn:R00624   Secondary_alcohol:NAD+ oxidoreductase; Secondary alcohol + NAD+ <=> Ketone + NADH + H+
ec:1.1.1.1  rn:R00754   ethanol:NAD+ oxidoreductase; Ethanol + NAD+ <=> Acetaldehyde + NADH + H+
ec:1.1.1.1  rn:R01172   butanal:NAD+ oxidoreductase (CoA-acylating); Butanal + CoA + NAD+ <=> Butanoyl-CoA + NADH + H+
ec:1.1.1.1  rn:R02124   retinol:NAD+ oxidoreductase; Retinol + NAD+ <=> Retinal + NADH + H+
ec:1.1.1.1  rn:R02878   1-Octanol:NAD+ oxidoreductase; 1-Octanol + NAD+ <=> 1-Octanal + NADH + H+
...

We may also be interested in knowing which other organisms contain the same enzymes:

$ for I in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/link/genome/ec:$i | grep -Po '(?<=genome:).*') | xargs -n 1 | xargs -I {} curl -s http://rest.kegg.jp/find/genome/genome:{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:1.10.2.2 genome:T01003   rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.10.2.2 genome:T01008   bta, BOVIN, 9913; Bos taurus (cow)
ec:1.1.1.1  genome:T00030   dme, DROME, 7227; Drosophila melanogaster (fruit fly)
ec:1.1.1.1  genome:T01001   hsa, HUMAN, 9606; Homo sapiens (human)
ec:1.1.1.1  genome:T01003   rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.1.1.1  genome:T01058   ecb, HORSE, 9796; Equus caballus (horse)
ec:1.1.1.100    genome:T00007   eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:1.1.1.108    genome:T00035   pae, PSEAE, 208964; Pseudomonas aeruginosa PAO1
ec:1.11.1.15    genome:T01001   hsa, HUMAN, 9606; Homo sapiens (human)
ec:1.11.1.15    genome:T01003   rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.11.1.5 genome:T00005   sce, YEAST, 559292; Saccharomyces cerevisiae S288c
ec:1.1.1.157    genome:T00564   ckl, CLOK5, 431943; Clostridium kluyveri DSM 555
ec:1.11.1.6 genome:T00115   lpl, LACPL, 220668; Lactobacillus plantarum WCFS1

Best Wishes,
Umer

ADD REPLY
1
Entering edit mode
10.2 years ago
5heikki 11k

A combination of KEGG's API, blastp and scripting should do it. If you're doing more small scale stuff, it might be feasible to just use http://www.kegg.jp/blastkoala/ and then the mapping tools provided at the site..

ADD COMMENT
0
Entering edit mode

Thanks for the answer. I tried that as well. But I was just wondering how sensitive this is. In blast KOALA the search is done using as reference the information for a Genus, and the information available for the genus I am working with are quite limited. Therefore, will I miss some protein known in other taxon to be involved in a specific process but not described in my genus?

ADD REPLY
0
Entering edit mode

I don't think so:

The database files are generated from KEGG GENES as a collection of representative genomes by removing similar organisms at the species, genus or family level. When multiple members are present in each species/genus/family group, the first genome is taken as a representative genome. When the other members in the group contain different K numbers that are not present in the representative genome, those genes are added as if they are present in additional chromosomes or plasmids.

Stated right there in the blast koala site..

ADD REPLY
0
Entering edit mode

Ok, got it, but for my Genus there are only two genomes available, will it change?

ADD REPLY
0
Entering edit mode
10.2 years ago
Ram 45k

I'd take in a list of proteins as query and BLASTP against organism-filtered NR, verify identity and similarity in top 3 hits per query to ensure they fall under a threshold.

ADD COMMENT
0
Entering edit mode
10.2 years ago
Asaf 10k

You might want to check the methods of this paper: Long-term phenotypic evolution of bacteria : Nature : Nature Publishing Group

They used Flux Balance analysis to determine the substrates a bacterial strain can grow on.

ADD COMMENT
0
Entering edit mode

This is really cool, thanks for sharing. I am afraid this is to complex for my capacity..:P

ADD REPLY

Login before adding your answer.

Traffic: 2103 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6