Question

How to safely split a fasta file with concatenated multiple fasta sequences

2

Entering edit mode

11.2 years ago

akhst7 ▴ 40

Hi

I have a fasta file which contains concatenated multiple ref human virus sequences from NCBI and each sequence has a usual NCBI header (starts with 'gi') as follows:

>gi|109390382|ref|NC_008188.1| Human papillomavirus type 103, complete genome
>gi|109390389|ref|NC_008189.1| Human papillomavirus type 101, complete genome
>gi|110645916|ref|NC_001401.2| Adeno-associated virus - 2, complete genome
>gi|134133206|ref|NC_009225.1| Torque teno midi virus 1, complete genome
>gi|134288556|ref|NC_009238.1| KI polyomavirus Stockholm 60, complete genome
>gi|139424470|ref|NC_009334.1| Human herpesvirus 4, complete genome
>gi|139472801|ref|NC_009333.1| Human herpesvirus 8, complete genome
>gi|148724565|ref|NC_009539.1| WU Polyomavirus, complete genome
>gi|155573622|ref|NC_006273.2| Human herpesvirus 5 strain Merlin, complete genome
>gi|165973999|ref|NC_010277.1| Merkel cell polyomavirus, complete genome
>gi|167600365|ref|NC_010329.1| Human papillomavirus type 88, complete genome

A size of this file is about 3.2MB and I'd like to split this file into 2 or more smaller files without breaking a sequence of the virus at the end/bottom of the files. Is there any easy or clever ways to accomplish this?

Thanks in advance.

genome • 4.4k views

ADD COMMENT • link updated 3.9 years ago by Ram 45k • written 11.2 years ago by akhst7 ▴ 40

0

Entering edit mode

Thanks for the posts. Any scripts using sed/awk, which may not to be a simpler solution than others?

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.2 years ago by akhst7 ▴ 40

Ram · Answer 1 · 2014-05-23

2

Entering edit mode

11.2 years ago

Vivek ★ 2.7k

faSplit from Jim Kent's resources is a suitable tool for the job.

ADD COMMENT • link updated 3.9 years ago by Ram 45k • written 11.2 years ago by Vivek ★ 2.7k

Ram · Answer 2 · 2014-05-23

2

Entering edit mode

11.2 years ago

Caddymob ★ 1.0k

Check out bioawk from Heng Li - and the great tutorial from Vince Buffalo

ADD COMMENT • link updated 3.9 years ago by Ram 45k • written 11.2 years ago by Caddymob ★ 1.0k

Ram · Answer 3 · 2014-05-23

1

Entering edit mode

11.2 years ago

Giovanni M Dall'Olio 28k

Try pyfasta:

pyfasta split -n 2 original.fasta

ADD COMMENT • link updated 3.9 years ago by Ram 45k • written 11.2 years ago by Giovanni M Dall'Olio 28k