How to safely split a fasta file with concatenated multiple fasta sequences
3
2
Entering edit mode
10.5 years ago
akhst7 ▴ 40

Hi

I have a fasta file which contains concatenated multiple ref human virus sequences from NCBI and each sequence has a usual NCBI header (starts with 'gi') as follows:

>gi|109390382|ref|NC_008188.1| Human papillomavirus type 103, complete genome
>gi|109390389|ref|NC_008189.1| Human papillomavirus type 101, complete genome
>gi|110645916|ref|NC_001401.2| Adeno-associated virus - 2, complete genome
>gi|134133206|ref|NC_009225.1| Torque teno midi virus 1, complete genome
>gi|134288556|ref|NC_009238.1| KI polyomavirus Stockholm 60, complete genome
>gi|139424470|ref|NC_009334.1| Human herpesvirus 4, complete genome
>gi|139472801|ref|NC_009333.1| Human herpesvirus 8, complete genome
>gi|148724565|ref|NC_009539.1| WU Polyomavirus, complete genome
>gi|155573622|ref|NC_006273.2| Human herpesvirus 5 strain Merlin, complete genome
>gi|165973999|ref|NC_010277.1| Merkel cell polyomavirus, complete genome
>gi|167600365|ref|NC_010329.1| Human papillomavirus type 88, complete genome

A size of this file is about 3.2MB and I'd like to split this file into 2 or more smaller files without breaking a sequence of the virus at the end/bottom of the files. Is there any easy or clever ways to accomplish this?

Thanks in advance.

genome • 4.0k views
ADD COMMENT
0
Entering edit mode

Thanks for the posts. Any scripts using sed/awk, which may not to be a simpler solution than others?

ADD REPLY
2
Entering edit mode
10.5 years ago
Vivek ★ 2.7k

faSplit from Jim Kent's resources is a suitable tool for the job.

ADD COMMENT
2
Entering edit mode
ADD COMMENT
1
Entering edit mode
10.5 years ago

Try pyfasta:

pyfasta split -n 2 original.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 2973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6