KEGG data parse
1
0
Entering edit mode
7.2 years ago

Hello I have a file like

ENTRY       EC 1.1.1.1                  Enzyme
NAME        alcohol dehydrogenase;
CLASS       Oxidoreductases;
SYSNAME     alcohol:NAD+ oxidoreductase
REACTION    (1) a primary alcohol + NAD+ = an aldehyde + NADH + H+ [RN:R00623];
ALL_REAC    R00623 > R00754 R02124 R02878 R04805 R04880 R05233 R05234 R06917 R06927 R08281 R08306 R08557 R08558 R10783;
SUBSTRATE   primary alcohol [CPD:C00226];
PRODUCT     aldehyde [CPD:C00071];
ENTRY       EC 1.1.1.157                Enzyme
NAME        3-hydroxybutyryl-CoA dehydrogenase;
CLASS       Oxidoreductases;
SYSNAME     (S)-3-hydroxybutanoyl-CoA:NADP+ oxidoreductase
REACTION    (S)-3-hydroxybutanoyl-CoA + NADP+ = 3-acetoacetyl-CoA + NADPH + H+ [RN:R01976]
ALL_REAC    R01976;
SUBSTRATE   (S)-3-hydroxybutanoyl-CoA [CPD:C01144];
PRODUCT     3-acetoacetyl-CoA [CPD:C00332];

and i need to convert it to

ENTRY NAME CLASS SYSNAME REACTION ALL_REAC SUBSTRATE PRODUCT

and the corresponding values in rows. can anybody help me writing a script for this purpose.

R data parsing • 2.3k views
ADD COMMENT
0
Entering edit mode
$ awk -F "     "  'FNR<9 {sub(" ","\t");gsub(";","");print $1,$2}' test | datamash transpose --no-strict | tr -d " " > out.txt

output (tab separated):

 $ cat out.txt 
ENTRY   NAME    CLASS   SYSNAME REACTION    ALL_REAC    SUBSTRATE   PRODUCT
EC1.1.1.1   alcoholdehydrogenase    Oxidoreductases alcohol:NAD+oxidoreductase  (1)aprimaryalcohol+NAD+=analdehyde+NADH+H+[RN:R00623]   R00623>R00754R02124R02878R04805R04880R05233R05234R06917R06927R08281R08306R08557R08558R10783 primaryalcohol[CPD:C00226]  aldehyde[CPD:C00071]
ADD REPLY
0
Entering edit mode

This command gives correct output for first entry only. Can you please manipulate it to the entire file. I am not meticulous in awk.

ADD REPLY
0
Entering edit mode

input:

$ cat test
ENTRY       EC 1.1.1.1                  Enzyme
NAME        alcohol dehydrogenase;
CLASS       Oxidoreductases;
SYSNAME     alcohol:NAD+ oxidoreductase
REACTION    (1) a primary alcohol + NAD+ = an aldehyde + NADH + H+ [RN:R00623];
ALL_REAC    R00623 > R00754 R02124 R02878 R04805 R04880 R05233 R05234 R06917 R06927 R08281 R08306 R08557 R08558 R10783;
SUBSTRATE   primary alcohol [CPD:C00226];
PRODUCT     aldehyde [CPD:C00071];
ENTRY       EC 1.1.1.157                Enzyme
NAME        3-hydroxybutyryl-CoA dehydrogenase;
CLASS       Oxidoreductases;
SYSNAME     (S)-3-hydroxybutanoyl-CoA:NADP+ oxidoreductase
REACTION    (S)-3-hydroxybutanoyl-CoA + NADP+ = 3-acetoacetyl-CoA + NADPH + H+ [RN:R01976]
ALL_REAC    R01976;
SUBSTRATE   (S)-3-hydroxybutanoyl-CoA [CPD:C01144];
PRODUCT     3-acetoacetyl-CoA [CPD:C00332];

command:

 $ sed 's/\s\+/\t /;s/.*ENT/\n&/g;s/          /\t/g' test.txt | cut -f1,2 | mlr --ixtab --omd cat | sed '2d;s/| //;s/\s*|\s*/\t/g'

output:

ENTRY   NAME    CLASS   SYSNAME REACTION    ALL_REAC    SUBSTRATE   PRODUCT 
EC 1.1.1.1  alcohol dehydrogenase;  Oxidoreductases;    alcohol:NAD+ oxidoreductase (1) a primary alcohol + NAD+ = an aldehyde + NADH + H+ [RN:R00623]; R00623 > R00754 R02124 R02878 R04805 R04880 R05233 R05234 R06917 R06927 R08281 R08306 R08557 R08558 R10783; primary alcohol [CPD:C00226];   aldehyde [CPD:C00071];  
EC 1.1.1.157    3-hydroxybutyryl-CoA dehydrogenase; Oxidoreductases;    (S)-3-hydroxybutanoyl-CoA:NADP+ oxidoreductase  (S)-3-hydroxybutanoyl-CoA + NADP+ = 3-acetoacetyl-CoA + NADPH + H+ [RN:R01976]  R01976; (S)-3-hydroxybutanoyl-CoA [CPD:C01144]; 3-acetoacetyl-CoA [CPD:C00332];

miller can be installed via ubuntu (till xenial-16.04)/mint (sonya- 18.2) repos. However, you would need latest version of Miller. Compile it from miller github.

ADD REPLY
5
Entering edit mode
7.2 years ago
Paul ★ 1.5k

Hi, this solution works on your example data. I just erase first column and substitute spaces with comma. Then used translate and paste command. Finally add header to your requirements. This works in case we still have the same number of rows.

Please test it.

awk -v OFS="," '$1=$1' INPUT | awk -F"," '{for( i=2; i<=NF; i++ ){printf( "%s ", $i )}; printf( "\n"); }'  | tr " " "," | paste - - - - - - - -  | awk -v OFS="\t" 'BEGIN{print "ENTRY","NAME","CLASS","SYSNAME","REACTION","ALL_REACT","SUBSTRATE","PRODUCT"}1'
ADD COMMENT
0
Entering edit mode

What should I do If I want to replace multiple space with comma not a single space with comma.

ADD REPLY
1
Entering edit mode

Hi, try to use sed: sed 's/ \{1,\}/,/g' file or if you prefer tr: tr -s ' ' < file | tr ' ' ',' . And what about my script? Does it work to you?

ADD REPLY
0
Entering edit mode

yes I tried it.. It works well. Thank you so much.

ADD REPLY
0
Entering edit mode

But what to do if my text is in multi-line.

ADD REPLY
0
Entering edit mode

Could you please copy/paste more example of your text? I'll look at it :-)

ADD REPLY

Login before adding your answer.

Traffic: 1092 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6