Question

Split multifasta file using awk command

0

Entering edit mode

5.5 years ago

fec2 ▴ 50

Hi,

I have a FASTA file and need to split the file into multiple FASTAs, one gene per file. Refer to the post Splitting A Fasta File, I have tried below

awk -F "|" '/^>/ {close(F) ; F = $1".fasta"} {print >> F}' yourfile.fa

However, every output file name contain symbol ">", for example ">my_contig_name.fasta".

May I know how to avoid to have ">" in the output file name? Thanks.

sequence • 6.3k views

ADD COMMENT • link updated 5.5 years ago by Jean-Karim Heriche 27k • written 5.5 years ago by fec2 ▴ 50

1

Entering edit mode

Please use the search function, this has been asked many times before:

Split multifasta file in individual sequence file

How to split a multi fasta file into individual chromosomes

splitting multifasta-file in python

Split the multiple sequences file into a separate files

ADD REPLY • link 5.5 years ago by ATpoint 85k

0

Entering edit mode

Hi,

Actually I have tried several command from these posts, but only the above command work for me. However, this command has created ">" in the output name.

ADD REPLY • link 5.5 years ago by fec2 ▴ 50

score 2 · Accepted Answer · 2019-06-06

2

Entering edit mode

5.5 years ago

AK ★ 2.2k

Try changing the command to:

awk -F "|" '/^>/ {close(F); ID=$1; gsub("^>", "", ID); F=ID".fasta"} {print >> F}' yourfile.fa

If not limited to awk, you can use: seqkit split --by-id yourfile.fa.

ADD COMMENT • link 5.5 years ago by AK ★ 2.2k

0

Entering edit mode

Thank you very much!

ADD REPLY • link 5.5 years ago by fec2 ▴ 50

score 1 · Accepted Answer · 2019-06-06

1

Entering edit mode

5.5 years ago

Jean-Karim Heriche 27k

Try

awk -F "|" '/^>/ {close(F) ; F = substr($1,2,length($1)-1)".fasta"} {print >> F}' yourfile.fa