Split Large Fasta Into Mulitple Files, Can'T Name Them With Gi Number
3
0
Entering edit mode
12.5 years ago

I should start out by saying that I'm as new as it gets to both Python and Biopython. I'm trying to split a large .fasta file (with multiple entries) into single files, each with a single entry. I found most of the following code on the Biopython wiki/ Cookbook site, and adapted it just a bit. My problem is that this generator names them as "1.fasta", "2.fasta", etc. and I need them named by some identifier such as GI number.

 def batch_iterator(iterator, batch_size) :
    """Returns lists of length batch_size.

    This can be used on any iterator, for example to batch up
    SeqRecord objects from Bio.SeqIO.parse(...), or to batch
    Alignment objects from Bio.AlignIO.parse(...), or simply
    lines from a file handle.

    This is a generator function, and it returns lists of the
    entries from the supplied iterator.  Each list will have
    batch_size entries, although the final list may be shorter.
    """
    entry = True #Make sure we loop once
    while entry :
        batch = []
        while len(batch) < batch_size :
            try :
                entry = next(iterator)
            except StopIteration :
                entry = None
            if entry is None :
                #End of file
                break
            batch.append(entry)
        if batch :
            yield batch

from Bio import SeqIO
infile = input('Which .fasta file would you like to open? ')
record_iter = SeqIO.parse(open(infile), "fasta")
for i, batch in enumerate(batch_iterator(record_iter, 1)) :
    outfile = "c:\python32\myfiles\%i.fasta" % (i+1)
    handle = open(outfile, "w")
    count = SeqIO.write(batch, handle, "fasta")
    handle.close()
    print ("Wrote %i records to %s" % (count, outfile))

If I try to replace:

outfile = "c:\python32\myfiles\%i.fasta" % (i+1)

with:

 outfile = "c:\python32\myfiles\%s.fasta" % record_iter.id)

so that it will name something similar to seq_record.id in SeqIO, it gives the following error:

   Traceback (most recent call last):
  File "C:\Python32\myscripts\generator.py", line 33, in <module>
    outfile = "c:\python32\myfiles\%s.fasta" % record_iter.id)
AttributeError: 'generator' object has no attribute 'id'

Although the generator function has no attribute 'id', can I get around this somehow? Is this script too complicated for what I'm trying to do?!? Thanks, Charles

biopython fasta • 8.4k views
ADD COMMENT
0
Entering edit mode

If you want just one sequence per file, the batch function is overkill.

ADD REPLY
1
Entering edit mode
12.5 years ago
Niek De Klein ★ 2.6k

I'm not exactly sure what you're trying to do with batch_iterator, but if you just want to make a separate file for each fasta entry you could do:

from Bio import SeqIO
infile = input('Which .fasta file would you like to open? ')
for record in SeqIO.parse(infile, "fasta"):
     outfile = open("c:\\python32\\myfiles\\"+record.id+".fasta"
     outfile.write(">"+record.description+"\n")
     outfile.write(record.seq)
     outfile.close()

Can I ask why you need a separate file for each fasta sequence?

ADD COMMENT
0
Entering edit mode

You are missing a ">" for the FASTA output. How about this for shortness:

from Bio import SeqIO
infile = input('Which .fasta file would you like to open? ')
for record in SeqIO.parse(infile, "fasta"):
    SeqIO.write(record, record.id + ".fasta", "fasta")

This assumes the record ID doesn't contain any nasty characters which would be invalid in a filename.

ADD REPLY
0
Entering edit mode

But of course the record ID contains invalid characters!! How could I go about replacing the | symbols with a filename-friendly character, such as - (dash) ?

ADD REPLY
0
Entering edit mode

you can do record.id.replace(";","_"), where you have to change the , in the illegal character.

ADD REPLY
1
Entering edit mode
12.5 years ago

Install genome tools and use the command-line:

gt splitfasta -splitdesc multifastafile.fa
ADD COMMENT
0
Entering edit mode
12.5 years ago

From the Python Bioinformatics blog, you can make yourself a nice and complete script that you can then call simply by:

split_fasta.py bigfile.fasta /path/to/directory/where/to/put_the_split_files

Just add this file in a directory explored by your $PATH, and make it executable with a:

chmod +x split_fasta.py

Here is the script:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import os
import sys
from optparse import OptionParser
from Bio import SeqIO

usage = "usage: %prog fasta_file_in directory_out"
parser = OptionParser(usage)
(opts, args) = parser.parse_args()
dir_out = os.getcwd()

if len(args)<1:
    print "Error: Please enter at least one argument."
    print "See program_name.py --help"
    sys.exit()
elif len(args)==2:
    dir_out = args[1]
elif len(args)>2:
    print "error: Please enter up to 2 arguments."
    print "See program_name.py --help"
    sys.exit()

file_in = args[0]

for record in SeqIO.parse(open(file_in), "fasta"):
    f_out = os.path.join(dir_out,record.id+'.fasta')
    SeqIO.write([record],open(f_out,'w'),"fasta")
ADD COMMENT
0
Entering edit mode

Thank you, Leonor

ADD REPLY

Login before adding your answer.

Traffic: 1590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6