Entering edit mode
10.5 years ago
akhst7
▴
40
Hi
I have a fasta file which contains concatenated multiple ref human virus sequences from NCBI and each sequence has a usual NCBI header (starts with 'gi') as follows:
>gi|109390382|ref|NC_008188.1| Human papillomavirus type 103, complete genome
>gi|109390389|ref|NC_008189.1| Human papillomavirus type 101, complete genome
>gi|110645916|ref|NC_001401.2| Adeno-associated virus - 2, complete genome
>gi|134133206|ref|NC_009225.1| Torque teno midi virus 1, complete genome
>gi|134288556|ref|NC_009238.1| KI polyomavirus Stockholm 60, complete genome
>gi|139424470|ref|NC_009334.1| Human herpesvirus 4, complete genome
>gi|139472801|ref|NC_009333.1| Human herpesvirus 8, complete genome
>gi|148724565|ref|NC_009539.1| WU Polyomavirus, complete genome
>gi|155573622|ref|NC_006273.2| Human herpesvirus 5 strain Merlin, complete genome
>gi|165973999|ref|NC_010277.1| Merkel cell polyomavirus, complete genome
>gi|167600365|ref|NC_010329.1| Human papillomavirus type 88, complete genome
A size of this file is about 3.2MB and I'd like to split this file into 2 or more smaller files without breaking a sequence of the virus at the end/bottom of the files. Is there any easy or clever ways to accomplish this?
Thanks in advance.
Thanks for the posts. Any scripts using sed/awk, which may not to be a simpler solution than others?