I am trying to split a multifasta
file into several smaller mutlifasta files.
This should be done according to a string in the header of each sequence
included.
As my sequences are the sequences from a annotated metagenom
the headers are slightly different then they would normally be (see below).
Though I have seen several examples of a multifasta file being split into single fasta files for each entry in that "multi-file", I could not find a nice solution
for my problem.
Particularly I'm trying to split a fasta file that looks something like this:
>EL0073as #10 ID=0.1;partial=00;size_aa=7890;start_type=none
TAGGCAGGCGTGGGGGTTTGT....
>EL00734864845r #570 ID=0.8;partial=01;size_aa=7890;start_type=none
CCTCTTCGGCCCTCA...
>EL0679495 #900 ID=0.9;partial=10;size_aa=7890;start_type=none
CAAGGACCGTTAGGGGC...
>EL0305fe #101 ID=0.4;partial=00;size_aa=7890;start_type=none
GCTGACGGCAACGTTAG...
And I want to have two files like this:
File 1: non_partial (partial=00
)
>EL0073as #10 ID=0.1;partial=00;size_aa=7890;start_type=none
TAGGCAGGCGTGGGGGTTTGT....
>EL0305fe #101 ID=0.4;partial=00;size_aa=7890;start_type=none
GCTGACGGCAACGTTAG...
File 2: partial (partial=10, partial=01, partial=11
)
>EL00734864845r #570 ID=0.8;partial=01;size_aa=7890;start_type=none
CCTCTTCGGCCCTCA...
>EL0679495 #900 ID=0.9;partial=10;size_aa=7890;start_type=none
CAAGGACCGTTAGGGGC...
My approach is to look if a header contains partial=00 I copy everything from that lines ">" (starting character) until the next ">" into a new file called "non_partial_sequences.fasta"
the initial file is then left only with sequences that carry partial=10, partial=01 or partial=11 in the header. I then would rename it to "partial_sequences.fasta"
.
But this seems to be too complicated..
Hoping for good suggestions. Thanks in advance.
Far from being too complicated, this is the simplest and most efficient approach, and is how I would probably do it.
Are we to assume the sequence are in no particular order etc., and the
partial=##
string is the only differentiator?