Question

Scripting solution to generate a list of KEGG ORTHOLOGY (KO) terms from a tab-delimited annotation file

0

Entering edit mode

7.2 years ago

jvire1 ▴ 10

Does anyone happen to know a basic scripting (perhaps awk or python) approach to extracting KEGG orthology terms from a tab delimited annotation file?

The file in question has rows that look look like this:

TRINITY_DN18877_c0_g1_i1    KEGG:zma:103654828`KEGG:zma:103654829`KEGG:zma:542341`KO:K02995
TRINITY_DN6301_c0_g1_i1     KEGG:zma:103647201`KO:K10798
TRINITY_DN12892_c3_g5_i1    KEGG:zma:103643875
TRINITY_DN13158_c1_g2_i35   KEGG:vvi:100249085`KO:K02435

What I'm ultimately needing is to extract the transcript ID in column one and the ko terms in column two. Like this:

TRINITY_DN6301_c0_g1_i1     K10798

The end goal is to use the list with KEGG Mapper (http://www.kegg.jp/kegg/tool/map_pathway.html) to see what KEGG pathways are present and most abundant in my transcriptome assembly.

RNA-Seq • 2.3k views

ADD COMMENT • link updated 7.2 years ago by Sparrow_kop ▴ 260 • written 7.2 years ago by jvire1 ▴ 10

score 2 · Answer 1 · 2017-09-12

2

Entering edit mode

7.2 years ago

Pierre Lindenbaum 164k

awk '{n=split($2,a,/`/);for(i=1;i<=n;++i) if(substr(a[i],1,3)=="KO:") printf("%s %s\n",$1,substr(a[i],4));}' input.txt
TRINITY_DN18877_c0_g1_i1 K02995
TRINITY_DN6301_c0_g1_i1 K10798
TRINITY_DN13158_c1_g2_i35 K02435

ADD COMMENT • link 7.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thank you! Worked like a charm.

-James

ADD REPLY • link 7.2 years ago by jvire1 ▴ 10

score 0 · Answer 2 · 2017-09-12

0

Entering edit mode

7.2 years ago

Sparrow_kop ▴ 260

In python, I assume the delimiter is tab

with open('your_file','r') as f:
    for line in f:
        if 'KO:' in line:
            line = line.strip().split('\t')
            print(line[0] + '\t' + line[1].split(':')[-1])

ADD COMMENT • link 7.2 years ago by Sparrow_kop ▴ 260