Question

What is the correct syntax for BioPythons SeqIO.parse()

0

Entering edit mode

3.7 years ago

robert.murphy ▴ 110

When reading in an assembly with BioPython's SeqIO the tutorial indicated when reading in multiple records one should do the following:

records = list(SeqIO.parse("somefile.fasta", "fasta"))

This produces the expected behaviour of a subscriptable list of records. However this syntax also functions fine:

records = SeqIO.parse("somefile.fasta", 'fasta')

The only difference being that I cant subscript this object, but I can loop over it.

So, what is the benefit of one over the other?

Unless you intend to access a specific record by subscript why would you wrap the SeqIO.parse() as a list? Is it perhaps more memory efficient somehow?

assembly biopython • 6.1k views

ADD COMMENT • link updated 3.7 years ago by Philipp Bayer 8.8k • written 3.7 years ago by robert.murphy ▴ 110

score 3 · Accepted Answer · 2021-09-15

3

Entering edit mode

3.7 years ago

Philipp Bayer 8.8k

SeqIO.parse() returns an iterator, see here: https://biopython.org/docs/1.75/api/Bio.SeqIO.html#module-Bio.SeqIO

Iterators only return an item when they're called. If you run list() on an iterator Python will keep on pulling out of the iterator until nothing's left, returning a potentially memory-hogging list. You're correct in that it's more memory efficient to use iterators.

A few other standard methods switched from returning lists to returning iterators between Python2 and Python3, mostly for memory efficiency. zip([1,2,3], [4,5,6]) returns a (possibly huge) list in Python2 but an iterator in Python3 that doesn't take up memory space on its own.

ADD COMMENT • link 3.7 years ago by Philipp Bayer 8.8k

0

Entering edit mode

Thank you very much for the response. So to summarise if I am understanding correctly:

The method indicated in the BioPython tutorial to wrap in a list() is bad as it can result in very large lists especially if dealing with an faa of coding sequences for example?

Given your explanation, I am confused as to why they would suggest this method over simply just leaving the SeqIO.parse iterator object alone give the lack of benefits outside which to call a specific record?

ADD REPLY • link 3.7 years ago by robert.murphy ▴ 110

1

Entering edit mode

Yep! You can wrap SeqIO in a list if you're sure that your input file is relatively small, but I would normally not do it.

All of my SeqIO code looks like this:

for seq in SeqIO.parse('bigfile.fasta', 'fasta):
    print(seq.id)

You might want to wrap SeqIO.parse() into a list if you're interested in random access to sequences or direct access to specific sequences by name or position, but there are more efficient ways of doing that without loading the entire file into memory. In that case I would suggest to look at SeqIO.index() that works with fastas and will not load the entire file into memory until you run list() or to_dict() on it. SeqIO.index works like a dictionary and lets you pull out sequences by name. See here: https://biopython.org/docs/1.75/api/Bio.SeqIO.html#input-multiple-records

ADD REPLY • link 3.7 years ago by Philipp Bayer 8.8k