Question

Pyfasta Split By Header

0

Entering edit mode

11.1 years ago

arronslacey ▴ 320

HI I am trying to split a fasta file using pyfasta, and print to individual files, but I'm having some trouble understanding the syntax. I am using:

pyfasta split --header afp2test.fasta afp2test.fasta

this runs, but doesnt split the files.

Wjat is the syntax i need?

python • 6.4k views

ADD COMMENT • link updated 6.9 years ago by al-ash ▴ 210 • written 11.1 years ago by arronslacey ▴ 320

0

Entering edit mode

I guess you need to say with what you wanna split the fasta:

extract sequence from the file. use the header flag to make a new fasta file. the args are a list of sequences to extract.

$ pyfasta extract --header --fasta test/data/three_chrs.fasta seqa seqb seqc

extract sequence from a file using a file containing the headers not wanted in the new file:

$ pyfasta extract --header --fasta input.fasta --exclude --file seqids_to_exclude.txt

extract sequence from a fasta file with complex keys where we only want to lookup based on the part before the space.

$ pyfasta extract --header --fasta input.with.keys.fasta --space --file seqids.txt

ADD REPLY • link 11.1 years ago by Phil S. ▴ 700

0

Entering edit mode

but is there a way to do this without haveing to write down all the headers on the command line. i have a file with 100's of sequnces and i just want to make a new file for each sequence. the "split" command allows you to do this by specifiying the amount of files, but no matter what i do using this, the order does not seem to be preserved, and 1 file always contains 2 sequences. there is an option to split by header, but thought this would automatically pick up the individual headers and put them in their own files.

ADD REPLY • link 11.1 years ago by arronslacey ▴ 320

score 8 · Answer 1 · 2013-11-07

8

Entering edit mode

11.1 years ago

Manu Prestat 4.1k

GNU csplit is done for that kind of jobs:

csplit -z -q -n 4 -f sequence_ sequences.fasta /\>/ {*}

ADD COMMENT • link 11.1 years ago by Manu Prestat 4.1k

0

Entering edit mode

+1, using correct tool and looks nice!

ADD REPLY • link 11.1 years ago by Phil S. ▴ 700

score 1 · Answer 2 · 2013-11-07

1

Entering edit mode

11.1 years ago

Phil S. ▴ 700

this will do the job:

 awk 'BEGIN {n_seq=0;} /^>/ {if(n_seq%1==0){file=sprintf("myseq%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }'   < sequences.fa

it wil generate a file for ech of the sequences in your file 'sequences.fa'...

ADD COMMENT • link 11.1 years ago by Phil S. ▴ 700

score 0 · Answer 3 · 2018-01-09

See https://pypi.python.org/pypi/pyfasta/

split the fasta file into one new file per header with “%(seqid)s” being filled into each filename.:
$ pyfasta split –header “%(seqid)s.fasta” original.fasta

You need to specify that sequence id (= the name of your fasta within multifasta file) will be used as the name of the new files. For that, seqid parameter is used.