When reading in an assembly with BioPython's SeqIO the tutorial indicated when reading in multiple records one should do the following:
records = list(SeqIO.parse("somefile.fasta", "fasta"))
This produces the expected behaviour of a subscriptable list of records. However this syntax also functions fine:
records = SeqIO.parse("somefile.fasta", 'fasta')
The only difference being that I cant subscript this object, but I can loop over it.
So, what is the benefit of one over the other?
Unless you intend to access a specific record by subscript why would you wrap the SeqIO.parse()
as a list? Is it perhaps more memory efficient somehow?
Thank you very much for the response. So to summarise if I am understanding correctly:
The method indicated in the BioPython tutorial to wrap in a
list()
is bad as it can result in very large lists especially if dealing with anfaa
of coding sequences for example?Given your explanation, I am confused as to why they would suggest this method over simply just leaving the
SeqIO.parse
iterator object alone give the lack of benefits outside which to call a specific record?Yep! You can wrap SeqIO in a list if you're sure that your input file is relatively small, but I would normally not do it.
All of my SeqIO code looks like this:
You might want to wrap SeqIO.parse() into a list if you're interested in random access to sequences or direct access to specific sequences by name or position, but there are more efficient ways of doing that without loading the entire file into memory. In that case I would suggest to look at
SeqIO.index()
that works with fastas and will not load the entire file into memory until you run list() or to_dict() on it.SeqIO.index
works like a dictionary and lets you pull out sequences by name. See here: https://biopython.org/docs/1.75/api/Bio.SeqIO.html#input-multiple-records