I have a Multi-Sequence FASTA file of the form -
>SDF123.1 blah blah
ATCTCTGGAAACTCGGTGAAAGAGAGTAT
AGTGATGAGGATGAGTGAG...
>SBF123.1 blah blah
ATCTCTGGAAACTCGGTGAAAGAGAGTAT
AGTGATGAGGATGAGTGAG....
And I want to extract the individual FASTA files into individual files (like here)
I wrote the following AWK code, but it runs too slow, as compared to when I did not have the close
command in it. By slow, I mean it only generates about a dozen files in a minute. I had to incorporate the close
command, since without it, I was getting the awk error - too many open files
.
Here is the code -
cat big_multi_sequence_file.fasta | awk -F ' ' '{
if (substr($0, 1, 1)==">") {filename=(substr($1,2) ".fa")}
print $0 >> filename; close (filename)
}'
How can I make this code more time efficient? I am new to awk.
Thank you!
If you are willing to try other solutions then
faSplit
(LINK for UNIX version) by Jim Kent is probably going to be one of the most efficient options. It is available for Linux/macOS. Be sure to add execute permissions (chmod a+x faSplit
).