Failure to extract CDS from embl file using BioPython
0
0
Entering edit mode
4.7 years ago

I'm pretty new to BioPython and I'm trying to use it to extract all of the CDS features from a .embl file. This is my code:

#!/usr/bin/python3.7

for rec in SeqIO.parse("file.embl", "embl"):
if rec.features:
for feature in rec.features:
      if feature.type == "CDS":
            print(feature.location)
            print (feature.qualifiers["protein_id"])
            print (feature.location.extract(rec).seq)

When I run my code I get the following error:

Traceback (most recent call last):
File "extractor.py", line 5, in <module>
 record = SeqIO.read("file.embl", "embl")
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/__init__.py", line 720, in read
 first = next(iterator)
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/__init__.py", line 655, in parse
 for r in i:
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 489, in parse_records
 record = self.parse(handle, do_features)
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 473, in parse
 if self.feed(handle, consumer, do_features):
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 440, in feed
 self._feed_first_line(consumer, self.line)
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 661, in _feed_first_line
 raise ValueError('Did not recognise the ID line layout:\n' + line)
ValueError: Did not recognise the ID line layout:
ID                   file ; ; ; ; ; 29902 BP.

I can't seem to find any relevant documentation or forum post on that specific error message. Can anyone help me figure out what's going on?

Thanks in advance.

biopython • 1.3k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Yes. Its an embl file that as generated by transferring annotations from a GenBank file to an unannotated FASTA.

ADD REPLY
1
Entering edit mode

There should be 2 or 3 or 6 semicolons (there's 5 in your header).

Here is the part of the script that generates the error:

def _feed_first_line(self, consumer, line):
        assert line[: self.HEADER_WIDTH].rstrip() == "ID"
        if line[self.HEADER_WIDTH :].count(";") == 6:
            # Looks like the semi colon separated style introduced in 2006
            self._feed_first_line_new(consumer, line)
        elif line[self.HEADER_WIDTH :].count(";") == 3:
            if line.rstrip().endswith(" SQ"):
                # EMBL-bank patent data
                self._feed_first_line_patents(consumer, line)
            else:
                # Looks like the pre 2006 style
                self._feed_first_line_old(consumer, line)
        elif line[self.HEADER_WIDTH :].count(";") == 2:
            # Looks like KIKO patent data
            self._feed_first_line_patents_kipo(consumer, line)
        else:
            raise ValueError("Did not recognise the ID line layout:\n" + line)
ADD REPLY
0
Entering edit mode

thanks, I'll try looking into this.

ADD REPLY

Login before adding your answer.

Traffic: 2178 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6