I have downloaded a big list (~ 450 MB) of sequences in embl format from ebi. Now I want to make a python dictionary of identity and length of each sequence.
from Bio import SeqIO length={} handle = open('seq1.embl','r') for record in SeqIO.parse(handle, "embl"): length[record.id[:-2]]=len(record.seq) print length
However this gives the following error message:
Traceback (most recent call last):
File "length.py", line 4, in <module>
for record in SeqIO.parse(handle, "embl"):
File "/usr/local/lib/python2.7/dist-packages/Bio/SeqIO/__init__.py", line 537, in parse
for r in i:
File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 445, in parse_records
record = self.parse(handle, do_features)
File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 428, in parse
if self.feed(handle, consumer, do_features):
File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 405, in feed
misc_lines, sequence_string = self.parse_footer()
File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 558, in parse_footer
or self.line.strip() == '//', repr(self.line)
AssertionError: 'XX'
This code works perfectly for another small list of sequences in embl format. I used Biopython 1.60 and 1.60+, Both gives the same error.
Pappu, it seems that there is at least one erroneous record in your huge file. I would print out each record id when iterating over sequences. This will give you a clue in which part of the file biopython crashes. Then open the file in your text editor and locate the record that causes the error.
Thanks. For example I get the same error when I try to parse this embl file in Biopython: http://www.ebi.ac.uk/ena/data/view/CM000771&display=txt&expanded=true