pathway mapping using KEGG
2
1
Entering edit mode
9.2 years ago

I have assigned KEGG ids for my newly sequenced protein sequences, using Using Kegg/Kaas, sow i have a list of IDs , how do i assign them pathway maps . i need to know which of the genes(proteins) is in what family

sequencing • 5.2k views
ADD COMMENT
0
Entering edit mode
9.2 years ago
Kamil ★ 2.3k

Could I ask you to provide an example of an input file and an example of your desired output? It might help us to better understand your question.

Perhaps you might find this tool useful? https://github.com/endrebak/kg

ADD COMMENT
0
Entering edit mode

The input is a file of protein sequences >5000

e.g.

>mgg4500002 qor, 1144-2184 (Clockwise) Quinone oxidoreductase
MAASQCKRSCSPMKAITLQTYGGPEVALLRHDAPIPQATPGHVLVKVACAGINFMDVHTR
QGKYAQSVTYPVRLPCTLGMEGAGVVVDVGAGVSHLHVGDRVAWCIAWGAYAEYAAVPAD
KIAQIPSAITFDQAAAAMFQGCTAHYLIDDVARLHVGSTCLVHAASGSIGQLLVQMARRL
GATVFATGSSAEKCAIALQRGAHQAWTYDEGRFAERVREATAGQGVDVVFDSLGKTTLRD
SFRACRTRGLIVNYGNVSGSLTDLDPIELGEAGSLFLTRPRLADHMADGATVQRRANAVF
AAMLEGSLTVEIEGHYSLETVKQVHARIEARQQIGKAVVWVDRDLV
>mgg4500003 BASYS00003, 2160-2531 (Clockwise) Hypothetical Protein BASYS00003
MGGPRLGLMQTKKKPADQAGLGYPANSAGSGVVAVQAISAAFGQATFLQTISTAFSDTVA
IQAISTTFDQATFLQTVSTAFSDTVVIQAIRTTFDQATFLQTVSTAFSDTVAIQAIRTTF
DQA
>mgg4500004 insK, 3371-2562 (CounterClockwise) Putative transposase InsK for insertion sequence element IS150
MRDLLKLVSLARSTYYYQLKAMGVADRLSSIKASIQTIQNEHKGRFGYRRMTLELRKERS
LINGKTVRRLMGELGLKCTVRPKKYRSYKGPMGEVSPNTLARQFEAEQPNQKWVTDVTEF
KVAGKKLYLSPVLDLYNGEIVAYQTAIRPQYALVGEMLEKAIEGLPEGGKPMLHSDQGWH
YRYPKYRERLEKAGLEQSMSRKGNCHDNATMESFFGTLKSEFYYRESFESVEQLQAGLDE
YIHYYNHKRIKVKLGGLSPVAYRTRSAVA

The output should look like this:

Amino acid metabolism

MAP00250 : Alanine, aspartate and glutamate metabolism
MAP00260 : Glycine, serine and threonine metabolism
MAP00270 : Cysteine and methionine metabolism
MAP00280 : Valine, leucine and isoleucine degradation
MAP00290 : Valine, leucine and isoleucine biosynthesis
MAP00300 : Lysine biosynthesis
MAP00310 : Lysine degradation
MAP00330 : Arginine and proline metabolism
MAP00340 : Histidine metabolism
MAP00350 : Tyrosine metabolism
MAP00360 : Phenylalanine metabolism
MAP00380 : Tryptophan metabolism
MAP00400 : Phenylalanine, tyrosine and tryptophan biosynthesis

Biosynthesis of other secondary metabolites

MAP00232 : Caffeine metabolism
MAP00311 : Penicillin and cephalosporin biosynthesis
MAP00401 : Novobiocin biosynthesis
MAP00402 : Benzoxazinoid biosynthesis
MAP00521 : Streptomycin biosynthesis
MAP00524 : Butirosin and neomycin biosynthesis
MAP00940 : Phenylpropanoid biosynthesis
MAP00950 : Isoquinoline alkaloid biosynthesis
MAP00960 : Tropane, piperidine and pyridine alkaloid biosynthesis
MAP00966 : Glucosinolate biosynthesis

All proteins mapped

Links:

ADD REPLY
0
Entering edit mode
7.8 years ago
biomonte ▴ 220

I solved this by using GhostKOALA.

Just need to provide your query amino acid sequences in FASTA format and speficy which KEGG GENES database file to be searched. You will get an email when your results are ready. On the results, if you go to "reconstruct pathway" it will tell you how many proteins match to each family and also which of the genes is in each family. Hope it helps!

ADD COMMENT
0
Entering edit mode

How long does it take usually for GhostKOALA to run a ~5mb AA fasta file? Cheers

ADD REPLY
0
Entering edit mode

I would say probably less than 1 hour. I tried with a 15MB AA fasta file and took about 3 hours.

ADD REPLY
0
Entering edit mode

Thanks, I uploaded a 1.3mb AA fasta file and it took 22 hours. I guess the server is busy at the moment?

cheers

Alan

ADD REPLY
0
Entering edit mode

Do the FASTA-formatted amino acid sequences have to be divided into proteins, like this:

>PROKKA_00002 hypothetical protein
MSINSSLQQLAGGIAAAIGGMIVVQKDNFSPIEHYDTLALVVAIFVGICVYVLSLVSKIV
RDKNKA*
>PROKKA_00003 ATP-dependent RNA helicase RhlE
LEALNRFKAGKTRVLVTTDLLARGIDIQFLPFVINYELPRSPKDYIHRIGRTVRAEASGE
AISFVSPEDQHHFKVIQKKMKKWVTMVEGDGLV*
>PROKKA_00004 Long-chain-fatty-acid--CoA ligase FadD13
MIIRGGENIYSSEVENILYEHPAVTDAALVGIPHQTLGEEPAAVVHLAPGMTATEEELRH
YVSERLAKFKVPVKIIFTQDTLPRNANGKILKRDLKALF*

I mean, they have to be, right? Otherwise, how would the program tell where one protein starts and the next one begins.

ADD REPLY
1
Entering edit mode

Welcome @willnotburn : As far as I am aware, partial proteins sequences can be used as input too. This means that you can input sequences that either do not start with M (5prime_partial) or do not end with * (3prime_partial). I did not have any problem with internal protein sequences either.

ADD REPLY
1
Entering edit mode

Thanks, Santiago! Partial protein sequence support definitely helps. But just so I get it clearly: each (full or partial) sequence has to have its own FASTA header >, followed by the sequence on the next line. Is that right?

ADD REPLY
0
Entering edit mode

Yep, it's just a regular FASTA formatted file :-)

ADD REPLY

Login before adding your answer.

Traffic: 1276 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6