Question

FASTA file split

0

Entering edit mode

4.8 years ago

priya120195 ▴ 20

Hii, I have a merged fasta file of 1500 sequences.I want to split it into only 2 files ,one having 1000 fasta sequnces and other having 500 fasta sequences with headers intact.Can anyone suggest me the way with proper command to do it easily by awk or grep?

alignment sequence • 3.7k views

ADD COMMENT • link updated 4.8 years ago by Hugo ▴ 380 • written 4.8 years ago by priya120195 ▴ 20

0

Entering edit mode

for simple fasta format, sequence in one line:

$ head -n 2000 in.fa > first_1k.fa 
$ tail -n 1000 in.fa > last_500.fa

for fasta with multiple line sequence format, using bioawk: https://github.com/lh3/bioawk

$ bioawk -cfastx 'NR<=1000{print ">"$name"\n"$seq}' in.fa > first_1k.fa
$ bioawk -cfastx 'NR>1000{print ">"$name"\n"$seq}' in.fa > last_500.fa

ADD REPLY • link 4.8 years ago by wm ▴ 570

0

Entering edit mode

try command faSplit given by UCSC utilities.

ADD REPLY • link 4.8 years ago by Chirag Parsania ★ 2.0k

0

Entering edit mode

If each raw sequence is in and only one line, then including the header it will be two lines, so you can use:

head  -n 2000 file.fasta > file1.fasta
tail  +2001 file.fasta > file2.fasta

+ before 2001 is necessary as it will output line 2001 and anything after that line.

ADD REPLY • link 4.8 years ago by Fatima ▴ 1000

0

Entering edit mode

Hii, This is my fasta file header

>hCoV-19/Country_name/1-27/2020|EPI_ISL_413522|2020-01-27

I have a merged fasta file.Can you suggest me a script or way to split my merged fasta file based on country_name? I want all the fasta sequences from one country in one separate fasta and same for others.Is it possible?

ADD REPLY • link updated 4.8 years ago by GenoMax 148k • written 4.8 years ago by priya120195 ▴ 20

0

Entering edit mode

Try this:

 awk '{ if( $0 ~ /^>/){print prevID"\t"seq; prevID=$0; seq=""} else {gsub(/\W/, "", $0) ; seq=seq$0} } END {print prevID"\t"seq}' YOUR_FASTA_FILE |awk 'FNR >1'|head -1000|sed 's/\t/\n/g' > first1000seqs.fas

for the last 500 sequences substitute the head -1000 to tail -500.

ADD REPLY • link 4.8 years ago by K.Gee ▴ 40

0

Entering edit mode

Are all the Country names right after the first / ???

ADD REPLY • link 4.8 years ago by K.Gee ▴ 40

0

Entering edit mode

yes .All country names are after first "/"

ADD REPLY • link 4.8 years ago by priya120195 ▴ 20

0

Entering edit mode

this is a solution using bioawk to process fasta file: https://github.com/lh3/bioawk

$ bioawk -cfastx '{split($name, a, "/"); print ">"$name"\n"$seq >a[2]}' in.fa

ADD REPLY • link 4.8 years ago by wm ▴ 570

0

Entering edit mode

Try this:

  awk '{ if( $0 ~ /^>/){print prevID"\t"seq; prevID=$0; seq=""} else {gsub(/\W/, "", $0) ; seq=seq$0} } END {print prevID"\t"seq}' YOUR_FASTA_FILE |awk 'FNR >1'|head -1000|sed 's/\t/\n/g' > first1000seqs.fas

for the last 500 sequences substitute the head -1000 to tail -500.

ADD REPLY • link 4.8 years ago by K.Gee ▴ 40

score 2 · Answer 1 · 2020-03-29

2

Entering edit mode

4.8 years ago

lakhujanivijay 5.9k

Priya, you can use seqkit

seqkit split2 your_fast_file.fa -s 1000 -f

-s, --by-size int split sequences into multi parts with N sequences

ADD COMMENT • link 4.8 years ago by lakhujanivijay 5.9k

score 0 · Answer 2 · 2020-04-09

0

Entering edit mode

4.8 years ago

Hugo ▴ 380

You can also use SEDA. To achieve the desired split, you may use the Split operation (under Choose operation / General) and configure Fixed number of sequences per file with 1000 sequences.

ADD COMMENT • link 4.8 years ago by Hugo ▴ 380