Question

Verify if all the components of a pathway are present in a genomes

2

Entering edit mode

10.2 years ago

dago ★ 2.8k

I would like to verify if all the protein in specific pathways (e.g. secretion systems, two component regulatory system) are present in a genome.

Is there a way to do this automatically iterating the search for many genomes?

genome annotation • 9.2k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by dago ★ 2.8k

1

Entering edit mode

10.2 years ago

5heikki 11k

A combination of KEGG's API, blastp and scripting should do it. If you're doing more small scale stuff, it might be feasible to just use http://www.kegg.jp/blastkoala/ and then the mapping tools provided at the site..

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by 5heikki 11k

0

Entering edit mode

Thanks for the answer. I tried that as well. But I was just wondering how sensitive this is. In blast KOALA the search is done using as reference the information for a Genus, and the information available for the genus I am working with are quite limited. Therefore, will I miss some protein known in other taxon to be involved in a specific process but not described in my genus?

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by dago ★ 2.8k

0

Entering edit mode

I don't think so:

The database files are generated from KEGG GENES as a collection of representative genomes by removing similar organisms at the species, genus or family level. When multiple members are present in each species/genus/family group, the first genome is taken as a representative genome. When the other members in the group contain different K numbers that are not present in the representative genome, those genes are added as if they are present in additional chromosomes or plasmids.

Stated right there in the blast koala site..

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by 5heikki 11k

0

Entering edit mode

Ok, got it, but for my Genus there are only two genomes available, will it change?

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by dago ★ 2.8k

0

Entering edit mode

10.2 years ago

Ram 45k

I'd take in a list of proteins as query and BLASTP against organism-filtered NR, verify identity and similarity in top 3 hits per query to ensure they fall under a threshold.

ADD COMMENT • link 3.0 years ago by Ram 45k

0

Entering edit mode

10.2 years ago

Asaf 10k

You might want to check the methods of this paper: Long-term phenotypic evolution of bacteria : Nature : Nature Publishing Group

They used Flux Balance analysis to determine the substrates a bacterial strain can grow on.

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by Asaf 10k

0

Entering edit mode

This is really cool, thanks for sharing. I am afraid this is to complex for my capacity..:P

ADD REPLY • link 10.2 years ago by dago ★ 2.8k

Ram · Accepted Answer · 2015-02-16

You can use MinPath to see if all the enzymes are sufficient for a given Pathway to exist (the other way around). This is what I will do:

If I start with contigs or a genome sequence (can be a file with multiple genomes in it), I will use a tool such as PROKKA to generate a genbank file of my annotation as well as all the extracted CDS regions as a FASTA file with amino acid sequences.

Say I start with the following file:

$ head Sample_1.fasta
>PROT1
AAPAAAAPAAAPAAQAAAPAAAGGDDAILSPAARKLAEEAGIDPNSIAGTGKGGRVTKEDVVAAVEAKKNAPAALAKPAAPAAEAPIFAAGDRVEKRVPMTRLRAKVAERLVEAQSAMAMLTTFNEVNMKPIMDLRSKYKDLFEKKHNGVRLGFMSFFVKAATEALKRFPGVNASIDGNDIVYHGYQDIGVAVSSDRGLVVPVLRNAEFMSLAEIEGGIANFGKKAKEGKLTIEDMTGGTFTISNGGVFGSLLSTPIVNPPQTAILGMHKIQERPMAVNGQVVILPMMYLALSYDHRLIDGKEAVSFLVAIKDLLEDPARLLLDV
>PROT2
AAPAAAGGDDAILSPAARKLAEEAGIDPNSIAGTGKGGRVTKEDVVAAVEAKKNAPAAPAKPAAPAAEAPIFAAGDRVEKRVPMTRLRAKVAERLVEAQSAMAMLTTFNEVNMKPIMDLRSKYKDLFEKKHNGVRLGFMSFFVKAATEALKRFPGVNASIDGNDIVYHGYQDIGVAVSSDRGLVVPVLRNAEFMSLAEIEGGIANFGKKAKEGKLTIEDMTGGTFTISNGGVFGSLLSTPIVNPPQTAILGMHKIQERPMAVNGQVVILPMMYLALSYDHRLIDGKEAVSFLVAIKDLLEDPARLLLDV
>PROT3
AKSKFERNKPHVNIGTIGHVDHGKTSLTAAITKYFGEFKAYDQIDAAPEEKARGITISTAHVEYETPNRHYAHVDCPGHADYVKNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVPAIVVFLNKVDQVDDAELLELVELEVRELLSSYEFPGDDIPIVKGSALAALEDSDKKIGEDAIRELMAAVDAYIPTPERPIDQPFLMPIEDVFSISGRGTVVTGRVERGIVKVGEEIEIVGIRPTTKTTCTGVEMFRKLLDQGQAGDNIGALLRGVDRNGVERGQILCKPGSVKPHRKFKAEAYILTKEEGGRHTPFFTNYRPQFYFRTTDVTGIVTLPEGTEMVMPGDNVTVDVELIVPIAMEEKLRFAIREGGRTVGAGIVASIVE
>PROT4
APAAAAPAAAPAAQAAAPAAAGGDDAILSPAARKLAEEAGIDPNSIAGTGKGGRVTKEDVVAAVEAKKNAPAAPAKPAAPAAEAPIFAAGDRVEKRVPMTRLRAKVAERLVEAQSAMAMLTTFNEVNMKPIMDLRSKYKDLFEKKHNGVRLGFMSFFVKAATEALKRFPGVNASIDGNDIVYHGYQDIGVAVSSDRGLVVPVLRNAEFMSLAEIEGGIANFGKKAKEGKLTIEDMTGGTFTISNGGVFGSLLSTPIVNPPQTAILGMHKIQERPMAVNGQVVILPMMYLALSYDHRLIDGKEAVSFLVAIKDLLEDPARLLLDV
>PROT5
ATLLAQAIIREGMKNVTAGANPMVVRKGIQDAVDAAVEALHKNSKKVTCSEDIARVATISAGDETVGKLIAEAMEKVTSDGVITVEESKTSETYSEVVEGMMFDRGYISPYMATDTEKMEAIMDDAYLLITDKKISNIQEILPLLEKIVQSGKKLVIIAEDVEGEALTTLILNKLRGTFNCAAV

Next, we use RAPsearch to search our protein sequences against the KEGG database and get a tab-delimited blast output file (identified by *.out.m8 ending). To speed up RAPsearch, you can specify the number of cores to run on using -z switch.

$ rapsearch -d /PATH_TO_KEGG_DB/kegg_db -z 2 -q Sample_1.fasta -o Sample_1.out
$ head *.out.m8
# RAPSearch
# Job submitted: Thu Jul  3 17:57:51 2014
# Query : Sample_1.fasta
# Subject : /PATH_TO_KEGG_DB/kegg_db
# Fields: Query	Subject	identity	aln-len	mismatch	gap-openings	q.start	q.end	s.start	s.end	log(e-value)	bit-score
PROT1	psa:PST_1876	89.2308	325	35	0	1	325	84	408	-156.9	557.37
PROT1	pmy:Pmen_2502	84.7095	327	48	1	1	325	84	410	-149.71	533.49
PROT1	psb:Psyr_2010	78.7692	325	67	1	1	325	89	411	-137.65	493.43
PROT1	hch:HCH_04744	73.2143	336	74	4	2	325	80	411	-130.46	469.54
PROT1	sde:Sde_2105	71.6049	324	89	2	5	325	80	403	-124.55	449.9

We can then identify enzymes in our dataset. For this purpose, we use a REST-style KEGG API to link KEGG entries (2nd column in the output file) to EC numbers:

$ awk -F"\t" '!/#/{print $1","$2}' Sample_1.out.m8 | while IFS=$"," read -r -a mA; do echo -e "${mA[0]}\t$(curl -s http://rest.kegg.jp/link/ec/${mA[1]})"; done | awk '{print $1"\t"$3}' > ec_assignments.txt
$ head ec_assignments.txt
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61

RAPsearch reports multiple matches for a single protein sequence. We assign the most abundant EC number to each protein sequence:

$ sort ec_assignments.txt | uniq -c | grep -i "ec:" | awk '{gsub("^ec:","",$3);if(ec[$2]){if($1>=c[$2]){ec[$2]=$3;c[$2]=$1}} else {ec[$2]=$3;c[$2]=$1}} END{for (i in ec) print i"\t"ec[i]}' > Sample_1.ec
$ tail Sample_1.ec
PROT1	2.3.1.61
PROT78	4.2.1.11
PROT2	2.3.1.61
PROT136	1.1.1.86
PROT137	1.12.7.2
PROT4	2.3.1.61
PROT138	2.7.9.1
PROT139	3.2.1.23
PROT95	1.1.1.86
PROT9	3.6.3.14

We next construct parsimonious pathways using MinPath as it does not attempt to reconstruct entire pathways from a given set of protein sequences identified in a protemics/genomics experiment, but determines the minimal set of biological pathways that must exist in the biological system to explain the input protein sequences. It takes a set of reference pathways and a set of proteins that can be mapped to one or more pathways, and finds the minimum number of pathways that can explain all the protein functions. MinPath comes with a mapping file ec2path that lists the reference pathways from MetaCyc and all the EC numbers associated with them. MetaCyc serves as an encyclopedia of metabolism containing more than 2151 patways from more than 2500 different organisms.

$  head -50 ec2path
#Metacyc pathway and ec mapping file
#Pathway	EC
PWY0-1479	3.1.26.12
PWY0-1479	2.7.7.56
PWY0-1479	3.1.13.5
PWY0-1479	3.1.26.5
PWY-6146	4.2.1.11
PWY-6146	5.4.2.1
PWY-6146	1.2.7.6
PWY-6146	3.5.4.9
PWY-6146	6.4.1.1
PWY-6146	1.2.7.1
PWY-6146	2.6.1.1
PWY-6146	2.6.1.2
PWY-6146	1.1.1.261
...

To construct pathways from the generated EC file using MetaCyc pathways (using default mapping file in MinPath), run it as follows:

$ MinPath1.2.py -any Sample_1.ec -map ec2path -report Sample_1.ec.report -details Sample_1.ec.details
$ head Sample_1.ec.report
path 2 any n/a  naive 1  minpath 0  fam0  39  fam-found  2  name  PWY-6146
path 4 any n/a  naive 1  minpath 1  fam0  8  fam-found  1  name  BRANCHED-CHAIN-AA-SYN-PWY
path 11 any n/a  naive 1  minpath 1  fam0  9  fam-found  1  name  PWY-821
path 44 any n/a  naive 1  minpath 1  fam0  1  fam-found  1  name  GLUTSYNIII-PWY
path 46 any n/a  naive 1  minpath 0  fam0  6  fam-found  1  name  PWY-5505
path 49 any n/a  naive 1  minpath 1  fam0  8  fam-found  1  name  PWY-6549
path 53 any n/a  naive 1  minpath 0  fam0  13  fam-found  1  name  THREOCAT-PWY
path 55 any n/a  naive 1  minpath 0  fam0  5  fam-found  1  name  ILEUSYN-PWY
path 56 any n/a  naive 1  minpath 0  fam0  11  fam-found  1  name  PWY-3001
path 57 any n/a  naive 1  minpath 0  fam0  6  fam-found  1  name  PWY-5101

The last column reports the pathways, which you can search on MetaCyc website to get the detailed information. For example, here is the record for BRANCHED-CHAIN-AA-SYN-PWY on the website:

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Metaproteomics/metacyc.png" style="height:410px; width:600px" />

However, we are interested in KEGG pathways so that we can draw metabolic networks using iPath. We, therefore, generate a new mapping file using the REST-style KEGG API:

$ for I in $(curl -S http://rest.kegg.jp/list/pathway | awk -F"\t" '{print $1}'); do curl -S http://rest.kegg.jp/link/ec/$i; done > ec2kegg
$ awk -F"\t" '!/^$/{gsub("^path:","",$1);gsub("^ec:","",$2);print $1"\t"$2}' ec2kegg > ec2kegg_final
$ chmod +x ec2kegg_final
$ head -50 ec2kegg_final
map00010	1.1.1.1
map00010	1.1.1.2
map00010	1.1.1.27
map00010	1.1.2.7
map00010	1.1.2.8
map00010	1.2.1.12
map00010	1.2.1.3
map00010	1.2.1.5
map00010	1.2.1.59
map00010	1.2.1.9
...

Now run MinPath1.2.py with the new mapping file using -map ec2kegg_final in the command:

$ MinPath1.2.py -any Sample_1.ec -map ec2kegg_final -report Sample_1.ec.report.kegg -details Sample_1.ec.details.kegg
$ head Sample_1.ec.report.kegg
path 1 any n/a  naive 1  minpath 0  fam0  45  fam-found  3  name  map00010
path 2 any n/a  naive 1  minpath 0  fam0  24  fam-found  1  name  map00020
path 3 any n/a  naive 1  minpath 0  fam0  50  fam-found  1  name  map00030
path 4 any n/a  naive 1  minpath 0  fam0  63  fam-found  1  name  map00040
path 5 any n/a  naive 1  minpath 0  fam0  70  fam-found  2  name  map00051
path 6 any n/a  naive 1  minpath 0  fam0  46  fam-found  1  name  map00052
path 9 any n/a  naive 1  minpath 0  fam0  13  fam-found  1  name  map00062
path 10 any n/a  naive 1  minpath 0  fam0  30  fam-found  2  name  map00071
path 18 any n/a  naive 1  minpath 0  fam0  11  fam-found  2  name  map00190
path 19 any n/a  naive 1  minpath 0  fam0  3  fam-found  1  name  map00195

To get the description of the reported KEGG pathways, use the following one-liner:

$ for I in $(awk '{print $14}' Sample_1.ec.report.kegg); do curl -S http://rest.kegg.jp/find/pathway/$i; done
path:map00010	Glycolysis / Gluconeogenesis
path:map00020	Citrate cycle (TCA cycle)
path:map00030	Pentose phosphate pathway
path:map00040	Pentose and glucuronate interconversions
path:map00051	Fructose and mannose metabolism
path:map00052	Galactose metabolism
path:map00062	Fatty acid elongation
path:map00071	Fatty acid degradation
path:map00190	Oxidative phosphorylation
path:map00195	Photosynthesis
path:map00230	Purine metabolism
path:map00240	Pyrimidine metabolism
path:map00250	Alanine, aspartate and glutamate metabolism
path:map00280	Valine, leucine and isoleucine degradation
path:map00281	Geraniol degradation
path:map00290	Valine, leucine and isoleucine biosynthesis
path:map00310	Lysine degradation
path:map00330	Arginine and proline metabolism
path:map00360	Phenylalanine metabolism
path:map00362	Benzoate degradation
path:map00380	Tryptophan metabolism
path:map00410	beta-Alanine metabolism
path:map00511	Other glycan degradation
path:map00531	Glycosaminoglycan degradation
path:map00592	alpha-Linolenic acid metabolism
path:map00600	Sphingolipid metabolism
path:map00604	Glycosphingolipid biosynthesis - ganglio series
path:map00620	Pyruvate metabolism
path:map01120	Microbial metabolism in diverse environments
...

We want to use interactive Pathways Explorer (iPath), which is a web-based tool for the visualization, analysis and customization of the various pathways maps. The metabolic pathways are constructed using 146 KEGG pathways, and give an overview of the complete metabolism in biological systems. We first extract the pathways in the format to be used with iPath and give each of them red color:

$ awk '{gsub("^map","",$14);print $14" #ff0000"}' Sample_1.ec.report.kegg > Sample_1.ec.report.kegg.ipath

Now go to http://pathways.embl.de/index.html and click on iPath v2 interface to load the applet:

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Metaproteomics/ipath.png" style="height:313px; width:600px" />

Next click on the "Customize" button to load the "Customization and data mapping" tab, and upload Sample_1.ec.report.kegg.ipath by clicking on "Select file" button. This will populate the "Element selection" text area. You can manually assign different colors to your pathways here. Finally, click on "Submit data and customize maps" button to generate your metabolic network.

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Metaproteomics/dataupload.png" style="height:424px; width:600px" />

Once generated, you can save the metabolic network as a PDF document to your local computer. The connections in red are the ones present in your sample:

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Metaproteomics/Sample_1.png" style="height:366px; width:600px" />

Best Wishes,
Umer

view raw biostars-131319.md hosted with ❤ by GitHub

Cleaned up the post and moved to a Gist by Ram on 22-Apr-2022.