FASTA file split
2
0
Entering edit mode
4.7 years ago
priya120195 ▴ 20

Hii, I have a merged fasta file of 1500 sequences.I want to split it into only 2 files ,one having 1000 fasta sequnces and other having 500 fasta sequences with headers intact.Can anyone suggest me the way with proper command to do it easily by awk or grep?

alignment sequence • 3.6k views
ADD COMMENT
0
Entering edit mode

for simple fasta format, sequence in one line:

$ head -n 2000 in.fa > first_1k.fa 
$ tail -n 1000 in.fa > last_500.fa

for fasta with multiple line sequence format, using bioawk: https://github.com/lh3/bioawk

$ bioawk -cfastx 'NR<=1000{print ">"$name"\n"$seq}' in.fa > first_1k.fa
$ bioawk -cfastx 'NR>1000{print ">"$name"\n"$seq}' in.fa > last_500.fa
ADD REPLY
0
Entering edit mode

try command faSplit given by UCSC utilities.

ADD REPLY
0
Entering edit mode

If each raw sequence is in and only one line, then including the header it will be two lines, so you can use:

head  -n 2000 file.fasta > file1.fasta
tail  +2001 file.fasta > file2.fasta

+ before 2001 is necessary as it will output line 2001 and anything after that line.

ADD REPLY
0
Entering edit mode

Hii, This is my fasta file header

>hCoV-19/Country_name/1-27/2020|EPI_ISL_413522|2020-01-27

I have a merged fasta file.Can you suggest me a script or way to split my merged fasta file based on country_name? I want all the fasta sequences from one country in one separate fasta and same for others.Is it possible?

ADD REPLY
0
Entering edit mode

Try this:

 awk '{ if( $0 ~ /^>/){print prevID"\t"seq; prevID=$0; seq=""} else {gsub(/\W/, "", $0) ; seq=seq$0} } END {print prevID"\t"seq}' YOUR_FASTA_FILE |awk 'FNR >1'|head -1000|sed 's/\t/\n/g' > first1000seqs.fas

for the last 500 sequences substitute the head -1000 to tail -500.

ADD REPLY
0
Entering edit mode

Are all the Country names right after the first / ???

ADD REPLY
0
Entering edit mode

yes .All country names are after first "/"

ADD REPLY
0
Entering edit mode

this is a solution using bioawk to process fasta file: https://github.com/lh3/bioawk

$ bioawk -cfastx '{split($name, a, "/"); print ">"$name"\n"$seq >a[2]}' in.fa
ADD REPLY
0
Entering edit mode

Try this:

  awk '{ if( $0 ~ /^>/){print prevID"\t"seq; prevID=$0; seq=""} else {gsub(/\W/, "", $0) ; seq=seq$0} } END {print prevID"\t"seq}' YOUR_FASTA_FILE |awk 'FNR >1'|head -1000|sed 's/\t/\n/g' > first1000seqs.fas

for the last 500 sequences substitute the head -1000 to tail -500.

ADD REPLY
2
Entering edit mode
4.7 years ago

Priya, you can use seqkit

seqkit split2 your_fast_file.fa -s 1000 -f

-s, --by-size int split sequences into multi parts with N sequences

ADD COMMENT
0
Entering edit mode
4.7 years ago
Hugo ▴ 380

You can also use SEDA. To achieve the desired split, you may use the Split operation (under Choose operation / General) and configure Fixed number of sequences per file with 1000 sequences.

ADD COMMENT

Login before adding your answer.

Traffic: 1662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6