Pyfasta Split By Header
3
0
Entering edit mode
11.1 years ago
arronslacey ▴ 320

HI I am trying to split a fasta file using pyfasta, and print to individual files, but I'm having some trouble understanding the syntax. I am using:

pyfasta split --header afp2test.fasta afp2test.fasta

this runs, but doesnt split the files.

Wjat is the syntax i need?

python • 6.4k views
ADD COMMENT
0
Entering edit mode

I guess you need to say with what you wanna split the fasta:

extract sequence from the file. use the header flag to make a new fasta file. the args are a list of sequences to extract.

$ pyfasta extract --header --fasta test/data/three_chrs.fasta seqa seqb seqc

extract sequence from a file using a file containing the headers not wanted in the new file:

$ pyfasta extract --header --fasta input.fasta --exclude --file seqids_to_exclude.txt

extract sequence from a fasta file with complex keys where we only want to lookup based on the part before the space.

$ pyfasta extract --header --fasta input.with.keys.fasta --space --file seqids.txt
ADD REPLY
0
Entering edit mode

but is there a way to do this without haveing to write down all the headers on the command line. i have a file with 100's of sequnces and i just want to make a new file for each sequence. the "split" command allows you to do this by specifiying the amount of files, but no matter what i do using this, the order does not seem to be preserved, and 1 file always contains 2 sequences. there is an option to split by header, but thought this would automatically pick up the individual headers and put them in their own files.

ADD REPLY
8
Entering edit mode
11.1 years ago

GNU csplit is done for that kind of jobs:

csplit -z -q -n 4 -f sequence_ sequences.fasta /\>/ {*}
ADD COMMENT
0
Entering edit mode

+1, using correct tool and looks nice!

ADD REPLY
1
Entering edit mode
11.1 years ago
Phil S. ▴ 700

this will do the job:

 awk 'BEGIN {n_seq=0;} /^>/ {if(n_seq%1==0){file=sprintf("myseq%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }'   < sequences.fa

it wil generate a file for ech of the sequences in your file 'sequences.fa'...

ADD COMMENT
0
Entering edit mode
6.9 years ago
al-ash ▴ 210

See https://pypi.python.org/pypi/pyfasta/

split the fasta file into one new file per header with “%(seqid)s” being filled into each filename.:
$ pyfasta split –header “%(seqid)s.fasta” original.fasta

You need to specify that sequence id (= the name of your fasta within multifasta file) will be used as the name of the new files. For that, seqid parameter is used.

ADD COMMENT

Login before adding your answer.

Traffic: 1400 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6