Question

Parsing gbk file

0

Entering edit mode

5.4 years ago

erick_rc93 ▴ 30

I have multiples genbak (.gbk) files, and each file is a concatenated file with multiple chromosomes and plasmids and I would like to split in single files, I'm trying with the next code in awk

awk -v n=1 '/^\/\//{close("out"n);n++;next} {print > "out"n}' filename.gbk

I'd like to get the output file with the same name of input file:

filename_1.gbk
filename_2.gbk
filename_3.gbk

shell • 1.1k views

ADD COMMENT • link updated 5.4 years ago by Pierre Lindenbaum 164k • written 5.4 years ago by erick_rc93 ▴ 30

1

Entering edit mode

I would strongly suggest using a proper parser like BioPython for this.

If for some reason you cannot, it should be sufficient to split the files up between the LOCUS and // lines.

ADD REPLY • link 5.4 years ago by Joe 21k

score 1 · Answer 1 · 2019-07-18

 wget -O - "ftp://ftp.ncbi.nlm.nih.gov/genbank/gbpln64.seq.gz" | gunzip -c | \
awk 'BEGIN{fname="";} /^LOCUS/ {close(fname);fname=sprintf("%s.gbk",$2);} {if(fname!="") print $0 >> fname; }'

$ ls *.gbk | head
CR354457.gbk
CR354458.gbk
CR354459.gbk
CR354460.gbk
CR354461.gbk
CR354462.gbk
CR354463.gbk
CR354464.gbk
CR354465.gbk
CR354466.gbk

EDIT. you want the filename:

sprintf("%s.%s.gbk",FILENAME,$2);