Splitting A Fasta File
4
2
Entering edit mode
11.4 years ago
andysw90 ▴ 20

Hi all,

I have a FASTA file which contains protein sequences of a load of genes from D. melanogaster, and need to split the file into multiple FASTAs, one gene per file. What's the best way to go about this? Ideally each file will be named with the name of the gene (UniProt ID).

Example, the sequence below wants to be split into 2 files, the first called O46197.fasta, the second called Q9VUQ5.fasta

Many thanks!

>sp|O46197|A29AB_DROME Accessory gland protein Acp29AB OS=Drosophila melanogaster GN=Acp29AB PE=2 SV=2
MYASNLLYLLALWNLWDLSGGQQDIPNGKATLPSPQTPQNTIDQIGINQNYWFTYNALKQ
NETLAIIDTMEMRIASSLLEFKAQMEIQLQPLKIIMRHHASNIKASNNIKMRRFEKVGSR
HFHIEKNLMQTWFEAYVTCRKMNGHLANIQDEMELDGILALAPNNSYWIDISKLVENGGT
FVSTLTGREPFFVKWKSNQDTKKKNQCVYIYAKEMSYDECFEKKSFVCQADQWA

.
>sp|Q9VUQ5|AGO2_DROME Protein argonaute-2 OS=Drosophila melanogaster GN=AGO2 PE=1 SV=3
MGKKDKNKKGGQDSAAAPQPQQQQKQQQQRQQQPQQLQQPQQLQQPQQLQQPQQQQQQQP
HQQQQQSSRQQPSTSSGGSRASGFQQGGQQQKSQDAEGWTAQKKQGKQQVQGWTKQGQQG
GHQQGRQGQDGGYQQRPPGQQQGGHQQGRQGQEGGYQQRPPGQQQGGHQQGRQGQEGGYQ
QRPSGQQQGGHQQGRQGQEGGYQQRPPGQQQGGHQQGRQGQEGGYQQRPSGQQQGGHQQG
RQGQEGGYQQRPPGQQQGGHQQGRQGQEGGYQQRPPGQQQGGHEQGRQGQEGGYQQRPSG
QQQGGHQQGRQGQEGGYQQRPSGQQQGGHQQGRQGQEGGYQQRPSGQQQGGHQQGRQGQE
GGYQQRPPGQQPNQTQSQGQYQSRGPPQQQQAAPLPLPPQPAGSIKRGTIGKPGQVGINY
LDLDLSKMPSVAYHYDVKIMPERPKKFYRQAFEQFRVDQLGGAVLAYDGKASCYSVDKLP
LNSQNPEVTVTDRNGRTLRYTIEIKETGDSTIDLKSLTTYMNDRIFDKPMRAMQCVEVVL
ASPCHNKAIRVGRSFFKMSDPNNRHELDDGYEALVGLYQAFMLGDRPFLNVDISHKSFPI
SMPMIEYLERFSLKAKINNTTNLDYSRRFLEPFLRGINVVYTPPQSFQSAPRVYRVNGLS
RAPASSETFEHDGKKVTIASYFHSRNYPLKFPQLHCLNVGSSIKSILLPIELCSIEEGQA
LNRKDGATQVANMIKYAATSTNVRKRKIMNLLQYFQHNLDPTISRFGIRIANDFIVVSTR
VLSPPQVEYHSKRFTMVKNGSWRMDGMKFLEPKPKAHKCAVLYCDPRSGRKMNYTQLNDF
GNLIISQGKAVNISLDSDVTYRPFTDDERSLDTIFADLKRSQHDLAIVIIPQFRISYDTI
KQKAELQHGILTQCIKQFTVERKCNNQTIGNILLKINSKLNGINHKIKDDPRLPMMKNTM
YIGADVTHPSPDQREIPSVVGVAASHDPYGASYNMQYRLQRGALEEIEDMFSITLEHLRV
YKEYRNAYPDHIIYYRDGVSDGQFPKIKNEELRCIKQACDKVGCKPKICCVIVVKRHHTR
FFPSGDVTTSNKFNNVDPGTVVDRTIVHPNEMQFFMVSHQAIQGTAKPTRYNVIENTGNL
DIDLLQQLTYNLCHMFPRCNRSVSYPAPAYLAHLVAARGRVYLTGTNRFLDLKKEYAKRT
IVPEFMKKNPMYFV
fasta sequence • 20k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Duplicate of many previous questions; search this site for "split fasta".

ADD REPLY
10
Entering edit mode
11.4 years ago

try the following awk script:

awk -F '|' '/^>/ {F=sprintf("%s.fasta",$2); print > F;next;} {print >> F;}' < yourfile.fa
ADD COMMENT
1
Entering edit mode

Hi Pierre, your Awk script could be simpler (I like simplicity!): awk -F "|" '/^>/ {F = $2".fasta"} {print > F}' yourfile.fa It reads like this: if the line starts with >, open a new file and write to it until you reach a new line starting with >. Note it is not necessary to use >> with Awk (the output file is only opened once).

ADD REPLY
1
Entering edit mode

Hi Frédéric, both of these scripts work well, however I'm puzzled by how your solution works at all - Pierre's version has the next statement in it, but I'd expect yours to print just the names and not the sequence - but it does :D Do you care to explain where the magic happens please? I'm no expert on awk though, just an advanced copy-paster (meaning I can combine multiple copy-pastes to get stuff working :) ). Thanks

Anyway, this awk solution only gives me 1020 files (Pierre's version 1021) and then throws an error (Too many open files) while I have over 6000 contigs in my fasta. I can use bash split to divide this file into e.g. six smaller files with the default names and then use this to export the individual contigs, but it's less elegant and I want it to be usable by other people. Do you guys know of any simpler workaround?

What I actually want is to split a fasta file from our recent study into contigs with alignment for the 9 individuals and then put the solution on github with the rest of the analysis, for the sake of reproducible research ;)

ADD REPLY
2
Entering edit mode

Hi cicindel,

awk is trying to open a file descriptor for each fasta entry, and you are reaching the maximum number of files that can be opened simultaneously. To avoid that, you just need to close a file as soon as you are done with it. Here is what you need to know about awk's close() command:

close() will silently do nothing if given an argument that does not represent a file, pipe or coprocess that was opened with a redirection.

So, closing the previous file before opening a new one should do the trick. Use >> if you think a file might be opened more than once: awk -F "|" '/^>/ {close(F) ; F = $1".fasta"} {print >> F}' yourfile.fa

ADD REPLY
0
Entering edit mode

Oh cool, thanks - that seems to work exactly as I want it :)

ADD REPLY
0
Entering edit mode

What could I do if I want to split my multifasta file or 14000 accession numbers (">") to files with 2000 accession numbers each?

ADD REPLY
1
Entering edit mode

For that you could use the bash split command.

ADD REPLY
6
Entering edit mode
11.4 years ago
SES 8.6k

There is an EMBOSS program called seqretsplit that was designed exactly for this purpose. Also, there is a script distributed with BioPerl called bp_seqretsplit.pl that performs this same function and does not use EMBOSS. The BioPerl script is only a couple of lines of code and you can modify it easily to write out the type of identifiers you want. For example, this script will produce the desired output:

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::SeqIO;
my $in = Bio::SeqIO->new(-format => 'fasta',
                         -fh   => \*ARGV);
while( my $s = $in->next_seq ) {
    my ($id) = ($s->id =~ /^(?:\w+)\|(\S+)\|/);
    Bio::SeqIO->new(-format => 'fasta',
                    -file   => ">".$id.".fasta")->write_seq($s);
}

Run it like:

$ perl biostars76716.pl < myseqs.fas

Note that there are some issues with the fasta you posted that do not look right, but I'll assume that is just an artifact of formatting your post.

ADD COMMENT
0
Entering edit mode

Top stuff, cheers

ADD REPLY
3
Entering edit mode
11.4 years ago
KCC ★ 4.1k

I thought I would write a python version. I stored the following text in a file called splitfasta.py

import sys
infile = open(sys.argv[1])
outfile = []

for line in infile:
    if line.startswith(">"):
        if (outfile != []): outfile.close()
        genename = line.strip().split('|')[1]
        filename = genename+".txt"
        outfile = open(filename,'w')
        outfile.write(line)
    else:
        outfile.write(line)
outfile.close()

The usage is: python splitfasta.py input.fa

ADD COMMENT
0
Entering edit mode

This is old, but which version of python did you use? With 2.7.2, 2.7.11, and 3.5.2 I get Traceback (most recent call last): File "splitfasta.py", line 13, in <module> outfile.write(line) AttributeError: 'list' object has no attribute 'write'

ADD REPLY
0
Entering edit mode
11.4 years ago

you may like to see couple of solution HERE

ADD COMMENT

Login before adding your answer.

Traffic: 1623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6