Question

Failure to extract CDS from embl file using BioPython

0

Entering edit mode

5.4 years ago

rjqmantaring • 0

I'm pretty new to BioPython and I'm trying to use it to extract all of the CDS features from a .embl file. This is my code:

#!/usr/bin/python3.7

for rec in SeqIO.parse("file.embl", "embl"):
if rec.features:
for feature in rec.features:
      if feature.type == "CDS":
            print(feature.location)
            print (feature.qualifiers["protein_id"])
            print (feature.location.extract(rec).seq)

When I run my code I get the following error:

Traceback (most recent call last):
File "extractor.py", line 5, in <module>
 record = SeqIO.read("file.embl", "embl")
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/__init__.py", line 720, in read
 first = next(iterator)
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/__init__.py", line 655, in parse
 for r in i:
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 489, in parse_records
 record = self.parse(handle, do_features)
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 473, in parse
 if self.feed(handle, consumer, do_features):
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 440, in feed
 self._feed_first_line(consumer, self.line)
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 661, in _feed_first_line
 raise ValueError('Did not recognise the ID line layout:\n' + line)
ValueError: Did not recognise the ID line layout:
ID                   file ; ; ; ; ; 29902 BP.

I can't seem to find any relevant documentation or forum post on that specific error message. Can anyone help me figure out what's going on?

Thanks in advance.

biopython • 1.6k views

ADD COMMENT • link 5.4 years ago by rjqmantaring • 0

0

Entering edit mode

Is this the first line of your file?

ID                   file ; ; ; ; ; 29902 BP.

Extracting more features from EMBL files with Biopython

Problem With Parsing Genome File - Embl Format - With Biopython

ADD REPLY • link 5.4 years ago by Fatima ▴ 1000

0

Entering edit mode

Yes. Its an embl file that as generated by transferring annotations from a GenBank file to an unannotated FASTA.

ADD REPLY • link 5.4 years ago by rjqmantaring • 0

1

Entering edit mode

There should be 2 or 3 or 6 semicolons (there's 5 in your header).

Here is the part of the script that generates the error:

def _feed_first_line(self, consumer, line):
        assert line[: self.HEADER_WIDTH].rstrip() == "ID"
        if line[self.HEADER_WIDTH :].count(";") == 6:
            # Looks like the semi colon separated style introduced in 2006
            self._feed_first_line_new(consumer, line)
        elif line[self.HEADER_WIDTH :].count(";") == 3:
            if line.rstrip().endswith(" SQ"):
                # EMBL-bank patent data
                self._feed_first_line_patents(consumer, line)
            else:
                # Looks like the pre 2006 style
                self._feed_first_line_old(consumer, line)
        elif line[self.HEADER_WIDTH :].count(";") == 2:
            # Looks like KIKO patent data
            self._feed_first_line_patents_kipo(consumer, line)
        else:
            raise ValueError("Did not recognise the ID line layout:\n" + line)