Question

Split Large Fasta Into Mulitple Files, Can'T Name Them With Gi Number

0

Entering edit mode

12.9 years ago

charles.bridges ▴ 70

I should start out by saying that I'm as new as it gets to both Python and Biopython. I'm trying to split a large .fasta file (with multiple entries) into single files, each with a single entry. I found most of the following code on the Biopython wiki/ Cookbook site, and adapted it just a bit. My problem is that this generator names them as "1.fasta", "2.fasta", etc. and I need them named by some identifier such as GI number.

 def batch_iterator(iterator, batch_size) :
    """Returns lists of length batch_size.

    This can be used on any iterator, for example to batch up
    SeqRecord objects from Bio.SeqIO.parse(...), or to batch
    Alignment objects from Bio.AlignIO.parse(...), or simply
    lines from a file handle.

    This is a generator function, and it returns lists of the
    entries from the supplied iterator.  Each list will have
    batch_size entries, although the final list may be shorter.
    """
    entry = True #Make sure we loop once
    while entry :
        batch = []
        while len(batch) < batch_size :
            try :
                entry = next(iterator)
            except StopIteration :
                entry = None
            if entry is None :
                #End of file
                break
            batch.append(entry)
        if batch :
            yield batch

from Bio import SeqIO
infile = input('Which .fasta file would you like to open? ')
record_iter = SeqIO.parse(open(infile), "fasta")
for i, batch in enumerate(batch_iterator(record_iter, 1)) :
    outfile = "c:\python32\myfiles\%i.fasta" % (i+1)
    handle = open(outfile, "w")
    count = SeqIO.write(batch, handle, "fasta")
    handle.close()
    print ("Wrote %i records to %s" % (count, outfile))

If I try to replace:

outfile = "c:\python32\myfiles\%i.fasta" % (i+1)

with:

 outfile = "c:\python32\myfiles\%s.fasta" % record_iter.id)

so that it will name something similar to seq_record.id in SeqIO, it gives the following error:

   Traceback (most recent call last):
  File "C:\Python32\myscripts\generator.py", line 33, in <module>
    outfile = "c:\python32\myfiles\%s.fasta" % record_iter.id)
AttributeError: 'generator' object has no attribute 'id'

Although the generator function has no attribute 'id', can I get around this somehow? Is this script too complicated for what I'm trying to do?!? Thanks, Charles

biopython fasta • 8.7k views

ADD COMMENT • link updated 12.9 years ago by Leonor Palmeira 3.9k • written 12.9 years ago by charles.bridges ▴ 70

0

Entering edit mode

If you want just one sequence per file, the batch function is overkill.

ADD REPLY • link 12.9 years ago by Peter 6.0k

score 1 · Answer 1 · 2012-05-30

1

Entering edit mode

12.9 years ago

Niek De Klein ★ 2.6k

I'm not exactly sure what you're trying to do with batch_iterator, but if you just want to make a separate file for each fasta entry you could do:

from Bio import SeqIO
infile = input('Which .fasta file would you like to open? ')
for record in SeqIO.parse(infile, "fasta"):
     outfile = open("c:\\python32\\myfiles\\"+record.id+".fasta"
     outfile.write(">"+record.description+"\n")
     outfile.write(record.seq)
     outfile.close()

Can I ask why you need a separate file for each fasta sequence?

ADD COMMENT • link 12.9 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

You are missing a ">" for the FASTA output. How about this for shortness:

from Bio import SeqIO
infile = input('Which .fasta file would you like to open? ')
for record in SeqIO.parse(infile, "fasta"):
    SeqIO.write(record, record.id + ".fasta", "fasta")

This assumes the record ID doesn't contain any nasty characters which would be invalid in a filename.

ADD REPLY • link 12.9 years ago by Peter 6.0k

0

Entering edit mode

But of course the record ID contains invalid characters!! How could I go about replacing the | symbols with a filename-friendly character, such as - (dash) ?

ADD REPLY • link 12.9 years ago by charles.bridges ▴ 70

0

Entering edit mode

you can do record.id.replace(";","_"), where you have to change the , in the illegal character.

ADD REPLY • link 12.9 years ago by Niek De Klein ★ 2.6k

score 1 · Answer 2 · 2012-05-30

1

Entering edit mode

12.9 years ago

Manu Prestat 4.1k

Install genome tools and use the command-line:

gt splitfasta -splitdesc multifastafile.fa

ADD COMMENT • link 12.9 years ago by Manu Prestat 4.1k

score 0 · Answer 3 · 2012-05-31

From the Python Bioinformatics blog, you can make yourself a nice and complete script that you can then call simply by:

split_fasta.py bigfile.fasta /path/to/directory/where/to/put_the_split_files

Just add this file in a directory explored by your $PATH, and make it executable with a:

chmod +x split_fasta.py

Here is the script:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import os
import sys
from optparse import OptionParser
from Bio import SeqIO

usage = "usage: %prog fasta_file_in directory_out"
parser = OptionParser(usage)
(opts, args) = parser.parse_args()
dir_out = os.getcwd()

if len(args)<1:
    print "Error: Please enter at least one argument."
    print "See program_name.py --help"
    sys.exit()
elif len(args)==2:
    dir_out = args[1]
elif len(args)>2:
    print "error: Please enter up to 2 arguments."
    print "See program_name.py --help"
    sys.exit()

file_in = args[0]

for record in SeqIO.parse(open(file_in), "fasta"):
    f_out = os.path.join(dir_out,record.id+'.fasta')
    SeqIO.write([record],open(f_out,'w'),"fasta")