Biopython 1.56 - Strange Behaviour Of Seqio.Parse And Seqio.Write
3
2
Entering edit mode
14.1 years ago
Markus ▴ 20

Hello everybody,

I'm using Biopython 1.56 compiled from source on Ubuntu 10.10 64-bit. It's a great piece of software and I love to work with it. But there is a very strange behavior of the "Bio.SeqIO.parse()" sequence parser, which cost me several hours to find out:

If I uncomment the for-statement with the print commands, "Bio.SeqIO.write()" refuses to write the sequences to the file.

Is this the desired behavior? Is the for-loop iterating the parser object to its end and leaves it there? Could anyone help me out? What am I getting wrong?

Thanks in advance!

Markus

Example:


handle = open("ls_orchid.fasta", "fasta")
parsedfasta = Bio.SeqIO.parse(handle, "fasta")

# If you uncomment the following part, 0 Sequences 
# (instead of many more) are written to the file!

#for seq_record in parsedfasta:   
    #print "Description:\t%s..." % seq_record.description
    #print "Length:\t\t%d" % len(seq_record)
    #print "Sequence:\t%s\n" % seq_record.seq[:40], seq_record.seq[44:50])

output_handle = open("example.fasta", "w")   
count = Bio.SeqIO.write(parsedfasta, output_handle, "fasta")
output_handle.close()
print "%d Sequences written to file" % count 
biopython python • 7.0k views
ADD COMMENT
12
Entering edit mode
14.1 years ago

Nice answers so far. SeqIO uses standard python iterators and generators:

http://www.learningpython.com/2009/02/23/iterators-iterables-and-generators-oh-my/

Using 'list' to turn an iterator into an in-memory list does allow you to parse it multiple times, but at the cost of needing to be able to fit the full file in memory. If you have a large file, or would just like to write your code in a more general manner, you can lazily move through the list of sequences a single time by doing the processing inside a generator function. This example processes each record then yields it for writing. Iterators and generators are very useful for dealing with large files in Python:

from Bio import SeqIO

handle = open("ls_orchid.fasta", "r")
parsedfasta = SeqIO.parse(handle, "fasta")

def print_seqrec(seq_record):
    """Specific function to analyze a sequence record.
    """
    print "Description:\t%s..." % seq_record.description
    print "Length:\t\t%d" % len(seq_record)
    print "Sequence:\t%s\n" % seq_record.seq[:40], seq_record.seq[44:50]

def process_and_generate(input_iterator, process_function):
    """Reusable function that processes a record, then generates each record.

    input_iterator is an iterator that returns one record at a time
    process_function is a function that takes one record and does some
      processing on it
    """
    for rec in input_iterator:
        process_function(rec)
        yield rec

generator = process_and_generate(parsedfasta, print_seqrec)

output_handle = open("example.fasta", "w")
count = SeqIO.write(generator, output_handle, "fasta")
output_handle.close()
print "%d Sequences written to file" % count
ADD COMMENT
0
Entering edit mode

Great answer! I'll try to integrate these ideas into my code. Thank you.

ADD REPLY
6
Entering edit mode
14.1 years ago
Casbon ★ 3.3k

Yes, you are consuming the generator with the for loop. Either call parse twice, or consume the generator in a list so that you can iterate over it more than once. This should work:

handle = open("ls_orchid.fasta", "fasta")
parsedfasta = list(Bio.SeqIO.parse(handle, "fasta"))

Notice we are using list to consume the iterator. For a large file this could take a lot of memory.

ADD COMMENT
0
Entering edit mode

Thanks a lot! It's nice to have some people around willing to help the helpless : )

ADD REPLY
0
Entering edit mode

open flag should be "r", not "fasta"

ADD REPLY
3
Entering edit mode
14.1 years ago
Thaman ★ 3.3k

The only thing you have to consider while parsing the result in case of Bio.SeqIO is:

  • Non-random access - Which means from Seq[0].......Seq[n] which is possbile just by iterator
  • Random access - Like your example which needs special built in python list function to turn the iterator into a list.
from Bio import SeqIO
handle = open("example.fasta", "rU")
records = list(SeqIO.parse(handle, "fasta"))
for seq_record in records:   
     print "Sequence:\t%s\n" % seq_record.seq[:40], seq_record.seq[44:50])
handle.close()

For more information visit SeqIO

Hope it will help

ADD COMMENT

Login before adding your answer.

Traffic: 1517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6