Question

Parsing List Of Sequences In Embl Format

0

Entering edit mode

12.6 years ago

Pappu ★ 2.1k

I have downloaded a big list (~ 450 MB) of sequences in embl format from ebi. Now I want to make a python dictionary of identity and length of each sequence.

from Bio import SeqIO
length={}
handle = open('seq1.embl','r')
for record in SeqIO.parse(handle, "embl"):
    length[record.id[:-2]]=len(record.seq)
print length

However this gives the following error message:

Traceback (most recent call last):
  File "length.py", line 4, in <module>
    for record in SeqIO.parse(handle, "embl"):
  File "/usr/local/lib/python2.7/dist-packages/Bio/SeqIO/__init__.py", line 537, in parse
    for r in i:
  File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 445, in parse_records
    record = self.parse(handle, do_features)
  File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 428, in parse
    if self.feed(handle, consumer, do_features):
  File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 405, in feed
    misc_lines, sequence_string = self.parse_footer()
  File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 558, in parse_footer
    or self.line.strip() == '//', repr(self.line)
AssertionError: 'XX'

This code works perfectly for another small list of sequences in embl format. I used Biopython 1.60 and 1.60+, Both gives the same error.

python • 4.0k views

ADD COMMENT • link updated 12.6 years ago by Damian Kao 16k • written 12.6 years ago by Pappu ★ 2.1k

1

Entering edit mode

Pappu, it seems that there is at least one erroneous record in your huge file. I would print out each record id when iterating over sequences. This will give you a clue in which part of the file biopython crashes. Then open the file in your text editor and locate the record that causes the error.

ADD REPLY • link 12.6 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Thanks. For example I get the same error when I try to parse this embl file in Biopython: http://www.ebi.ac.uk/ena/data/view/CM000771&display=txt&expanded=true

ADD REPLY • link 12.6 years ago by Pappu ★ 2.1k

score 2 · Answer 1 · 2012-12-22

2

Entering edit mode

12.6 years ago

Damian Kao 16k

The problem is with the CO lines in the file I think. The BioPython embl parser expects SQ and CO lines to be at the end of the file. Since your CO line is before all your FT lines, it couldn't find a stop line (//), instead found a XX line and threw an error.